An emoji on screen looks like one thing. Inside the text it is almost never one codepoint. Modern emoji are sequences: a base character optionally followed by a presentation selector, optionally followed by a skin-tone modifier, optionally followed by a Zero Width Joiner that glues it to another emoji to form something composite. Flags are pairs of letters. The Spanish flag and the family of four are both built by the same machinery โ a set of conventions agreed by font vendors and codified by the Unicode Technical Committee.
Emoji vs text presentation
Many emoji codepoints predate the emoji era. The HEAVY BLACK HEART โค at U+2764 was added in Unicode 1.1 in 1993, decades before phones rendered it as a colourful pillow. The standard distinguishes two presentations for these dual-purpose codepoints:
- Text presentation
- Black-on-paper, follows the surrounding font. Forced by appending U+FE0E (VARIATION SELECTOR-15).
- Emoji presentation
- Colourful, font-independent emoji glyph. Forced by appending U+FE0F (VARIATION SELECTOR-16).
U+2764 โ โค (default presentation, font-dependent)
U+2764 U+FE0E โ โค๏ธ (text โ VS15)
U+2764 U+FE0F โ โค๏ธ (emoji โ VS16)
Each emoji codepoint has a default presentation listed in emoji-data.txt. The pure-emoji codepoints (the ones added specifically for emoji use, in the U+1F300โU+1F9FF range and elsewhere) have Emoji_Presentation = Yes and need no selector. The dual-use codepoints have Emoji_Presentation = No and need a VS16 if you want the colourful form to appear on systems that otherwise default to text.
Skin tone modifiers
The five skin tone modifiers were added in Unicode 8.0 in 2015. They are codepoints U+1F3FB through U+1F3FF, named after the Fitzpatrick scale (a dermatological classification of human skin types developed by Thomas B. Fitzpatrick in 1975). Applied to a human emoji, the modifier replaces the default yellow tone with the chosen skin colour.
| Codepoint | Name | Fitzpatrick type | Glyph |
|---|---|---|---|
| U+1F3FB | EMOJI MODIFIER FITZPATRICK TYPE-1-2 | I-II (light) | ๐ป |
| U+1F3FC | EMOJI MODIFIER FITZPATRICK TYPE-3 | III | ๐ผ |
| U+1F3FD | EMOJI MODIFIER FITZPATRICK TYPE-4 | IV | ๐ฝ |
| U+1F3FE | EMOJI MODIFIER FITZPATRICK TYPE-5 | V | ๐พ |
| U+1F3FF | EMOJI MODIFIER FITZPATRICK TYPE-6 | VI (dark) | ๐ฟ |
๐ U+1F44B WAVING HAND โ yellow default
๐๐ฝ U+1F44B U+1F3FD โ medium skin tone
UTF-8 bytes for ๐๐ฝ:
F0 9F 91 8B F0 9F 8F BD (8 bytes for 2 codepoints, 1 grapheme)
Not every human emoji accepts a modifier โ the ones that do are listed in emoji-data.txt under the property Emoji_Modifier_Base. The fonts that ship with iOS, Android, Windows, and major Linux desktops all implement the modifier mechanism; a system that lacks the ligature will show the base hand and the colour swatch as two separate glyphs side by side.
Zero Width Joiner sequences
The Zero Width Joiner at U+200D is the most generative piece of emoji machinery. Originally introduced in Unicode 1.1 for Arabic and Indic ligatures, it was repurposed in 2010 for emoji: a sequence of emoji separated by ZWJs becomes a single composite emoji if and only if the font has a ligature for that exact sequence.
The family-of-four emoji is the canonical example:
๐จโ๐ฉโ๐งโ๐ฆ FAMILY: MAN, WOMAN, GIRL, BOY
Codepoint sequence:
U+1F468 MAN
U+200D ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D ZERO WIDTH JOINER
U+1F467 GIRL
U+200D ZERO WIDTH JOINER
U+1F466 BOY
That is 7 codepoints.
UTF-8 bytes:
F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D
F0 9F 91 A7 E2 80 8D F0 9F 91 A6
โ 25 bytes total. One grapheme cluster.
Other widely supported ZWJ sequences:
| Glyph | Sequence | Codepoints |
|---|---|---|
| ๐ณ๏ธโ๐ | WHITE FLAG + VS16 + ZWJ + RAINBOW | U+1F3F3 U+FE0F U+200D U+1F308 |
| ๐ดโโ ๏ธ | BLACK FLAG + ZWJ + SKULL AND CROSSBONES + VS16 | U+1F3F4 U+200D U+2620 U+FE0F |
| ๐จโ๐ป | MAN + ZWJ + LAPTOP COMPUTER | U+1F468 U+200D U+1F4BB |
| ๐จโ๐ฉโ๐ง | FAMILY: MAN, WOMAN, GIRL | U+1F468 U+200D U+1F469 U+200D U+1F467 |
| ๐ฉ๐ฝโ๐ | WOMAN + skin tone IV + ZWJ + ROCKET | U+1F469 U+1F3FD U+200D U+1F680 |
| ๐จโโค๏ธโ๐โ๐จ | KISS: MAN, MAN | U+1F468 U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468 |
On a font that supports the sequence, the rendered output is a single composite glyph. On a font that does not, you see the individual parts separated by small gaps where the ZWJs are invisible. The Unicode standard does not require fonts to support any particular ZWJ sequence โ the list of recommended sequences is maintained in emoji-zwj-sequences.txt and grows with each release.
Flag emoji and Regional Indicators
Country flags use a different mechanism. Unicode does not assign a separate codepoint to each national flag (the political minefield was deliberately avoided). Instead, there are 26 Regional Indicator Symbol codepoints โ U+1F1E6 through U+1F1FF โ one for each Latin letter A through Z. A pair of Regional Indicators forms a flag if the resulting two-letter sequence corresponds to an ISO 3166-1 alpha-2 country code that the font has a flag for.
๐บ๐ธ UNITED STATES FLAG
= Regional Indicator U + Regional Indicator S
= U+1F1FA U+1F1F8
๐ช๐ธ SPAIN FLAG
= Regional Indicator E + Regional Indicator S
= U+1F1EA U+1F1F8
๐ฏ๐ต JAPAN FLAG
= Regional Indicator J + Regional Indicator P
= U+1F1EF U+1F1F5
The Regional Indicators on their own render as styled letters (often as a letter inside a box). The pair-rendering is, again, a font ligature. Windows famously did not ship flag glyphs in Segoe UI Emoji for years โ the underlying codepoints were there, but the font drew them as letter boxes, producing the now-familiar "ZW JP" rather than ๐ฏ๐ต. Windows 11 finally shipped flag glyphs in 2022.
For regional flags below national level โ Scotland, England, Wales, Texas โ Unicode introduced a separate mechanism in Unicode 10.0 (2017): the Tag Sequence. A black flag U+1F3F4 is followed by a sequence of tag characters from U+E0020โU+E007E spelling out the ISO 3166-2 subdivision code, terminated by U+E007F CANCEL TAG.
๐ด๓ ง๓ ข๓ ฅ๓ ฎ๓ ง๓ ฟ ENGLAND FLAG
= U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F
(black flag + g + b + e + n + g + cancel tag)
โ that is "gb-eng", the ISO code for England.
Counting emoji
The simplest emoji is one codepoint. The compound family is seven. A kiss with two skin-tone modifiers โ ๐จ๐ฝโโค๏ธโ๐โ๐จ๐ฟ โ is ten codepoints and 35 UTF-8 bytes. A user sees one image; "๐จ๐ฝโโค๏ธโ๐โ๐จ๐ฟ".length in JavaScript is 17 (because UTF-16 needs surrogate pairs for each SMP codepoint, and the ZWJs and VS16 are single units). The only correct way to count emoji-as-perceived is grapheme-cluster segmentation. See codepoint, character, glyph, grapheme for the four-way framing.
An "emoji" property in the Unicode database does not mean a codepoint renders as a colourful pictograph. It means the codepoint is eligible for emoji presentation. Numerical digits 0โ9 carry the Emoji property โ that is why keycap sequences like 1๏ธโฃ work: digit + VS16 + U+20E3 COMBINING ENCLOSING KEYCAP.
What goes wrong
Most emoji failures fall into a small number of categories:
- Missing ZWJ ligature. The font has the individual emoji but not the composite. The compound shows up as four people in a row instead of one family. The codepoints are correct; the font simply lacks the substitution rule.
- Missing variation selector handling. An old system displays a heart as a black outline because it does not honour U+FE0F. Adding the selector explicitly is harmless on modern systems and helpful on old ones.
- Truncation inside a sequence. A backend that truncates by codepoint can leave a dangling ZWJ at the end of a string, which then renders as a small box. Truncate by grapheme cluster instead.
- Surrogate splits. Code that operates on UTF-16 code units rather than codepoints can split a single SMP emoji into a high surrogate and a low surrogate, neither of which is valid on its own. Use codepoint iteration (
forโฆofin JavaScript,iter()in Python).
For inspection, the character inspector shows every codepoint inside an emoji with its name and category; the UTF-8 encoder shows the byte counts. The Emoticons block page lists the original 80 smiley faces added in Unicode 6.0.
Further reading
- Character inspector โ paste any emoji and list every codepoint inside it.
- UTF-8 encoder โ see emoji byte sequences in all three Unicode encodings.
- Codepoint, character, glyph, grapheme โ why an emoji is one grapheme but never one byte.
- Emoticons block โ the original 80 smiley faces at U+1F600โU+1F64F.
- โค U+2764 โ the dual-use character that needs VS16 to render in colour.
- UTF-8, UTF-16, UTF-32 compared โ why JavaScript's emoji length is so misleading.