Letters (L) — Unicode general category Lu, Ll, Lt, Lm, Lo

The Letter group is by far the largest of the seven major General Categories. Roughly 134,000 of the 154,998 assigned codepoints in Unicode 16.0 carry an L-category. The bulk of that count comes from one subcategory, Lo, because every Han ideograph in the CJK Unified Ideographs block and its extensions is an Lo letter. The remaining four — Lu, Ll, Lt, Lm — together account for under 5,000 codepoints, but they are the ones you spend the most time thinking about as a developer, because they participate in case mapping, identifier syntax and word boundaries.

The subcategories

Lu: Letter, uppercase — uppercase forms of cased scripts. The Latin A–Z, the Greek Α–Ω, the Cyrillic А–Я, plus Armenian, Coptic, Glagolitic, Cherokee (since Unicode 8.0 added its uppercase), Adlam, Osage and the cased Georgian. Examples: A, B, Ω, Ж, Ɣ.
Ll: Letter, lowercase — the matching lowercase forms. Most Lu/Ll pairs are linked by a Simple Uppercase Mapping and a Simple Lowercase Mapping, which is what String.prototype.toUpperCase() uses. Examples: a, b, ω, ж, ɣ.
Lt: Letter, titlecase — a small population of 31 digraphs used in Croatian, Serbian, Slovenian and a few archaic Latin orthographies, where the first letter is capitalised but the second remains lowercase (compare DŽ, Dž and dž). Each digraph has its own three precomposed codepoints. Example: ǅ (U+01C5).
Lm: Letter, modifier — small letters that semantically modify the preceding base. Used for IPA superscripts, Arabic letter superscripts, tone marks in transcription. Unlike combining marks (Mn), modifier letters take horizontal space. Example: ʰ (U+02B0), ʷ (U+02B7), ˆ-shaped phonetic marks.
Lo: Letter, other — letters in scripts that do not have case. The CJK ideographs, all Arabic letters, Hebrew letters, every Indic consonant, the Japanese kana, Korean Hangul syllables, Tibetan, Thai, Mongolian and every other non-cased script. Examples: 中, あ, א, ا, ㄱ.

Case mapping

The Lu/Ll relationship is one of the trickiest pieces of basic Unicode plumbing. Most letters have a simple 1:1 mapping — A ↔ a, Ä ↔ ä, Ω ↔ ω. But Unicode also defines full case mappings, where a single codepoint can expand on case change: ß (U+00DF) uppercases to SS, and the Greek lowercase sigma at the end of a word, ς (U+03C2), maps back to Σ (U+03A3) the same as medial sigma. A few codepoints are case-insensitive only via the caseFolding file — folding I (U+0049) and İ (U+0130) is the famous Turkish trap, because the Turkish dotless capital I lowercases to ı (U+0131), while Turkish dotted İ lowercases to i. Software that assumes str.toUpperCase().toLowerCase() === str.toLowerCase() will silently break Turkish text.

Regex and identifier syntax

The Unicode-aware regex shorthand \p{L} matches any codepoint in the Letter group. \p{Lu}, \p{Ll}, \p{Lt}, \p{Lm} and \p{Lo} match each subcategory specifically. Programming-language identifier rules are typically built on L + Nd + Pc + Lo + Lm: this is what Python 3, Java, and modern JavaScript permit. The UAX #31 Identifier and Pattern Syntax document formalises exactly which subcategories are eligible to start an identifier (L and a few Other_ID_Start) and which can continue it (L, M, Nd, Pc, plus Other_ID_Continue).

Example characters

A small selection of letters from across the L subcategories. Click any with an existing detail page; the rest are shown for reference.

The subcategories

Case mapping

Regex and identifier syntax

Example characters

Related