CATEGORY · C · OTHER

Other

Control codes, invisible format characters, the UTF-16 surrogates, the Private Use Areas, and every codepoint that isn't yet assigned to a character.

The Other group is the disposal-bin of the General Category system. It collects every codepoint that doesn't render as a graphic — control codes inherited from ASCII, invisible format characters that influence display, the surrogate range that exists only for UTF-16, the three Private Use Areas, and the roughly 819,000 codepoints that haven't been assigned to anything yet. Despite never being printed, these codepoints are responsible for a disproportionate share of bugs in text-processing systems.

The subcategories

Cc
Other, control — exactly 65 codepoints. The C0 controls U+0000 through U+001F inherited from ASCII, the DEL character U+007F, and the C1 controls U+0080 through U+009F inherited from ISO/IEC 6429. These include NULL, the line-ending characters LF and CR, the horizontal TAB U+0009, the form feed, the escape character used by ANSI terminal codes, and the C1 controls like CSI U+009B that legacy systems used to introduce escape sequences.
Cf
Other, format — invisible characters that affect rendering or interpretation without producing a glyph. ZERO WIDTH JOINER U+200D (ZWJ), ZERO WIDTH NON-JOINER U+200C (ZWNJ), the bidi controls U+202A–U+202E and U+2066–U+2069, BYTE ORDER MARK U+FEFF (when not at start of file; used as ZWNBSP), SOFT HYPHEN U+00AD, LEFT-TO-RIGHT MARK U+200E, RIGHT-TO-LEFT MARK U+200F, the language tag characters in plane 14, and the variation selectors VS1–VS256.
Cs
Other, surrogate — exactly 2,048 codepoints from U+D800 through U+DFFF. These are reserved permanently and will never be assigned to characters. They exist only to allow UTF-16 to encode codepoints above U+FFFF as pairs of 16-bit code units. Encoding a lone surrogate as UTF-8 is invalid; well-formed UTF-8 cannot contain bytes that would represent U+D800–U+DFFF.
Co
Other, private use — 137,468 codepoints across three ranges. The BMP Private Use Area at U+E000–U+F8FF (6,400 codepoints), Supplementary Private Use Area-A at U+F0000–U+FFFFD (65,534), and SPUA-B at U+100000–U+10FFFD (65,534). These codepoints are guaranteed never to be assigned by Unicode and are free for private agreements — vendor logos, conscript fonts, internal markup.
Cn
Other, unassigned — every codepoint not yet given a character. About 819,000 in Unicode 16.0. Includes the 66 designated noncharacters (U+FDD0–U+FDEF and the U+FFFE/U+FFFF at the end of each plane), which are reserved for internal use and must never appear in interchange.

Control characters: the ASCII inheritance

The 32 C0 control codes plus DEL come straight from ASCII (1963/1967). Some are still in active use — LF U+000A and CR U+000D as line terminators, HT U+0009 as horizontal tab, the BEL U+0007 that still rings in some terminal emulators. Most of the rest are obsolete legacies of teletype protocols: ACK, NAK, ENQ, EOT, SO/SI shift codes, the device controls DC1–DC4. The ESC U+001B is the entry point to ANSI terminal escape sequences, which is why typing it directly in a terminal can have wild effects. The C1 controls U+0080–U+009F are even more obscure — chiefly CSI U+009B (Control Sequence Introducer), used by VT-220 and successor terminals. In normal text these should never appear; their presence often signals corrupted encoding.

Format characters: the invisible workforce

Cf codepoints are where the action is in modern Unicode. ZWJ U+200D joins two graphemes into a single visual unit — the engine behind family emoji 👨‍👩‍👧‍👦, the gender variants 🧑‍⚕️, and the Devanagari conjunct half-forms. ZWNJ U+200C does the opposite: it prevents shaping that would otherwise occur, used in Persian text to break a junction between two letters that would normally join. Variation selectors VS1–VS256 (U+FE00–U+FE0F plus U+E0100–U+E01EF) follow a base character to request a specific glyph variant — VS15/VS16 toggle text vs emoji presentation. BIDI controls push and pop directional embeddings; the famous "Trojan Source" attack of 2021 hid hostile code in source files by abusing RLI and LRI to reorder rendered text away from logical order.

Surrogates: the UTF-16 scaffolding

Surrogates are an artifact of UTF-16. When UCS-2 needed to escape the 16-bit cage in 1996, the Unicode designers reserved U+D800–U+DBFF as high surrogates and U+DC00–U+DFFF as low surrogates, with the property that any value above U+FFFF could be encoded as a (high, low) pair. The math is fixed: code = 0x10000 + ((high − 0xD800) × 0x400) + (low − 0xDC00). UTF-8 and UTF-32 do not need surrogates and indeed cannot contain them — JavaScript's notorious "lone surrogate" strings (because String is a UTF-16 sequence) are not well-formed Unicode and cannot be losslessly converted to UTF-8 by TextEncoder without WTF-8 extensions.

Private Use and the unencoded

The three Private Use Areas hold 137,468 codepoints that Unicode promises never to assign. They are commonly used for conscripts (Tengwar at U+E000+, Klingon pIqaD at U+F8D0–U+F8FF in the ConScript registry), vendor symbols (Apple's logo at U+F8FF), medievalist scholarly mappings (the MUFI registry uses thousands of PUA codepoints for ligatures and abbreviations not in regular Unicode), and game/UI icon fonts. Two organisations have built consensus around PUA assignments — the ConScript Unicode Registry and MUFI — but neither agreement is binding outside its participants. PUA text is therefore inherently context-dependent: the same codepoint can mean different things in different documents.

Example characters

U+0000 · CcNull U+0009 · CcHorizontal Tab U+000A · CcLine Feed U+000D · CcCarriage Return U+001B · CcEscape U+007F · CcDelete U+00AD · CfSoft Hyphen U+200B · Cf·Zero Width Space U+200C · Cf·Zero Width Non-Joiner U+200D · Cf·Zero Width Joiner U+200E · Cf·Left-to-Right Mark U+FEFF · Cf·Byte Order Mark U+FE0F · Cf·Variation Selector-16 U+D800 · Cs?High Surrogate (not a character) U+F8FF · Co?Last BMP PUA codepoint U+E000 · Co?First BMP PUA codepoint

Related