CATEGORY · E · EMOJI

Emoji

"Emoji" is not one of the 30 General Categories — it's a separate boolean property defined in UTS #51 that cuts across several of them.

The Emoji property is one of the easiest pieces of Unicode to get wrong, because the public mental model — "emoji are a kind of character" — doesn't match the standard. Strictly, Emoji is a binary property recorded in the file emoji-data.txt alongside three siblings: Emoji_Presentation, Emoji_Modifier, and Emoji_Modifier_Base. The full definition lives in UTS #51. As of Unicode 16.0 about 3,790 codepoints carry Emoji=Yes.

Why the General Categories don't line up

The General Category was settled in the early 1990s, decades before emoji entered Unicode in 2009. Rather than break compatibility, the Unicode Technical Committee added the Emoji property as a separate dimension. The consequence is that the codepoints with Emoji=Yes are spread across several Gcs.

Where emoji live (by General Category)

So
Symbol, other — the majority. Pictographic emoji like 🌍 (U+1F30D EARTH GLOBE EUROPE-AFRICA), 🍕 (U+1F355 SLICE OF PIZZA), the smileys 😀 😂 🥰 (U+1F600+), the animals 🐶 🐱 (U+1F415, U+1F408), most of the people emoji (🧑 U+1F9D1 and friends), and the dingbats that became emoji — ❤ U+2764, ✨ U+2728, ✅ U+2705. The snowman ☃ U+2603 is So and emoji.
Po
Punctuation, other — three codepoints are emoji in this category: # U+0023, * U+002A, and the ASCII digits 0–9 (which are actually Nd, see below). The hash and asterisk become "keycap" emoji 1️⃣ #️⃣ *️⃣ only when followed by U+FE0F and U+20E3.
Nd
Number, decimal digit — the ten ASCII digits U+0030 through U+0039 all carry Emoji=Yes. They render as ordinary digits unless followed by VS16 U+FE0F + COMBINING ENCLOSING KEYCAP U+20E3, which produces the keycap emoji 1️⃣ 2️⃣ … 9️⃣ 0️⃣.
Sm
Symbol, math — a handful, mainly the arrows that became emoji presentation when paired with VS16: ⬆ ⬇ ⬅ ➡ (U+2B06, U+2B07, U+2B05, U+27A1).
Sk
Symbol, modifier — the five skin-tone modifiers U+1F3FB through U+1F3FF (light, medium-light, medium, medium-dark, dark). These are Emoji_Modifier=Yes and apply to any Emoji_Modifier_Base=Yes codepoint.
Cf
Other, format — the regional-indicator letters U+1F1E6 through U+1F1FF (Cf? actually So) and the tag characters used in subdivision flags 🏴󠁧󠁢󠁥󠁮󠁧󠁿. ZWJ U+200D and Variation Selector-16 U+FE0F are Cf and appear inside almost every compound emoji sequence.

The four emoji properties

UTS #51 actually defines several related properties. The most important are:

  • Emoji — the base property; true for any codepoint that can be displayed as emoji.
  • Emoji_Presentation — true if the codepoint defaults to emoji style (colorful, square) without needing a variation selector. Newer emoji are nearly all Emoji_Presentation=Yes; older ones inherited from Wingdings / Webdings / Zapf Dingbats default to text style.
  • Emoji_Modifier — the five skin-tone modifiers U+1F3FB–U+1F3FF.
  • Emoji_Modifier_Base — any emoji that accepts a skin-tone modifier (people emoji, hand gestures, etc.).
  • Emoji_Component — building blocks used inside sequences (skin tones, hair colors, regional indicators, ZWJ).
  • Extended_Pictographic — a broader set used by grapheme cluster segmentation. Includes the "Emoji=Yes" set plus a few extra symbols that need to be treated as pictographic for breaking purposes.

Sequences are not codepoints

Most "emoji" you actually see are sequences, not single codepoints. The family 👨‍👩‍👧‍👦 is seven codepoints joined by ZWJs. The Scottish flag 🏴󠁧󠁢󠁳󠁣󠁴󠁿 is eight codepoints (BLACK FLAG + tag-letters g, b, s, c, t + CANCEL TAG). The man-firefighter emoji 🧑‍🚒 is a person joined to a fire engine. None of these sequences is a codepoint with its own General Category — each component codepoint has its own category, while the sequence is recognised as a single grapheme cluster by UAX #29 segmentation and gets a single emoji glyph from a competent font.

Because of this, the answer to "how many emoji are there?" depends on what you count. The number of codepoints with Emoji=Yes is around 1,431. The number of distinct emoji shown to users — counting skin-tone variants, gender variants, ZWJ family sequences, country flags, subdivision flags — is around 3,790 (Unicode 16.0 figures, per emoji-test.txt).

Practical implications

For developers, the consequences are:

  • String length is the wrong metric for "number of emoji" in a string. Use grapheme-cluster segmentation (Intl.Segmenter in JS, ICU's BreakIterator, or graphemes() in Swift).
  • Substring operations must be cluster-aware or you will split a family emoji into a man and three orphan ZWJs.
  • Searching for "any emoji" in a string requires testing the Emoji property, not a General Category. Modern regex flavours expose this as \p{Emoji}.
  • Sorting on the General Category will give surprising results because keycap-base digits land in Nd and the regional-indicator letters land far from other emoji codepoints.

Example characters

U+2603 · SoSnowman U+2764 · SoHeavy Black Heart U+2713 · SoCheck Mark U+1F600 · So😀Grinning Face U+1F602 · So😂Face With Tears of Joy U+1F308 · So🌈Rainbow U+1F30D · So🌍Earth Globe E/A U+1F355 · So🍕Slice of Pizza U+1F415 · So🐕Dog U+1F9D1 · So🧑Adult (gender-neutral) U+0023+FE0F+20E3#️⃣Keycap: # (base Po) U+0031+FE0F+20E31️⃣Keycap: 1 (base Nd) U+1F3FB · Sk🏻Light Skin Tone U+1F3FF · Sk🏿Dark Skin Tone U+1F1FA U+1F1F8🇺🇸Flag: United States ZWJ sequence👨‍👩‍👧‍👦Family: Man, Woman, Girl, Boy

Related