Casing and collation in Unicode

The two operations that look easiest in ASCII — uppercasing and sorting — are the two that go strangest at Unicode scale. An English programmer typing "INSTANBUL".toLowerCase() expects a single answer. A Turkish user reading the result might see two. A Greek text round-tripped through a naive lowercaser changes the shape of its final letter. A list of names sorted by codepoint puts Zoë after Zoe and before Zora, which is exactly wrong in every locale where anyone reads it. The Unicode standard documents the rules that make these operations correct; the cost is that almost none of them are pure functions of the string alone.

The four case operations

Unicode defines four case-related transformations, each backed by data in the Unicode Character Database:

Uppercase mapping: From any character to the uppercase equivalent. Defined by the Uppercase_Mapping property (simple) and SpecialCasing.txt (full, context-sensitive).
Lowercase mapping: From any character to the lowercase equivalent. Symmetric counterpart.
Titlecase mapping: From any character to the titlecase form, used at the start of a word. Differs from uppercase for the few digraph characters where the second letter is conventionally not capitalised (Dutch IJ, ǅ, ǈ, ǋ).
Case folding: A one-way transformation for case-insensitive comparison. Defined by CaseFolding.txt. Always lowercases, sometimes more aggressively than the plain lowercase mapping.

The four operations look similar and produce different output on a handful of characters that matter disproportionately. Uppercasing ß (U+00DF, German sharp s) used to be undefined; under full case mapping it becomes SS. Case-folding ß produces ss — useful when you want "GROßE".toLowerCase() compared to "grosse" to match. Uppercasing the Latin ligature ﬃ (U+FB03) yields FFI. The output of a case operation can be longer than the input.

Simple versus full case mapping

The standard documents two flavours of each mapping. The simple mapping is one-to-one: each input codepoint maps to exactly one output codepoint, or to itself. The full mapping permits one-to-many: a single input codepoint can expand to a sequence.

Input	Simple uppercase	Full uppercase
a	A	A
ß (U+00DF)	ß (no change)	SS
ﬃ (U+FB03)	ﬃ (no change)	FFI
ﬄ (U+FB04)	ﬄ (no change)	FFL
ŉ (U+0149, preceded by apostrophe)	ŉ	ʼN (two codepoints)

The full mapping is what most modern languages provide. JavaScript's String.prototype.toUpperCase uses full case mapping. Java's String.toUpperCase() uses full mapping. Python's str.upper() uses full mapping. The simple mapping is mostly useful for one-codepoint-in-one-codepoint-out APIs at the codepoint level (Character.toUpperCase(int) in Java).

Case folding

Case folding is not lowercasing. It is a separate transformation designed for case-insensitive comparison, and it makes choices that lowercase does not.

ß (U+00DF) → ss: Full case folding decomposes the sharp s, so STRAßE case-folded matches strasse.
µ (U+00B5 micro sign) → μ (U+03BC Greek small mu): The Latin-1 micro sign folds to the Greek letter mu. Lowercasing leaves the micro sign alone; case folding canonicalises to the Greek letter that visually matches it.
ϐ (U+03D0 Greek beta symbol) → β (U+03B2): The alternate beta glyph folds to the regular beta.
ς (U+03C2 final sigma) → σ (U+03C3): Folding collapses both Greek sigmas to the medial form, which lowercasing does not do.

The rule of thumb: lowercase for display, case-fold for comparison. Languages expose this distinction unevenly. Python provides str.casefold() separately from str.lower(). Java exposes case folding only through ICU. JavaScript does not expose case folding directly — you can approximate it with toLowerCase() on most strings but not on German or Greek, where the difference matters.

The Turkish I

The most famous trap in Unicode casing is the dotted-versus-dotless I problem. In Turkish, Azerbaijani, and a few related orthographies, the Latin letter I has two case pairs, not one:

Locale: en-US                    Locale: tr-TR
i (U+0069)  →  I (U+0049)        i (U+0069)  →  İ (U+0130)
                                 ı (U+0131)  →  I (U+0049)

In English the lowercase of I is i and the uppercase of i is I. In Turkish, the lowercase of I is ı (dotless i, U+0131) and the uppercase of i is İ (capital I with dot above, U+0130). The distinction is not optional. Turkish keyboards have separate keys for the dotted and dotless forms; a name spelled with the wrong one is a different name.

The consequence in code:

"TITLE".toLowerCase()                    // "title"
"TITLE".toLocaleLowerCase('tr-TR')        // "tıtle"  (note dotless i)

"title".toUpperCase()                     // "TITLE"
"title".toLocaleLowerCase('tr-TR')        // "tıtle"
"title".toLocaleUpperCase('tr-TR')        // "TİTLE"  (note dotted I)

If the runtime's default locale is Turkish, calling toLowerCase on the string "FILE" produces "fıle", not "file". A historic class of bugs — Microsoft's tooling has tripped on it more than once — comes from naively case-folding identifiers, HTTP headers, configuration keys, or SQL keywords on systems with Turkish locales. The fix is to use locale-invariant or root-locale casing for anything that is supposed to be ASCII-equivalent: "FILE".toLocaleLowerCase('en-US') or "FILE".toLowerCase('en-US'), or in environments without locale parameters, the lowercase mapping from CaseFolding.txt entries flagged C and S only.

Java's String.equalsIgnoreCase is locale-independent and immune to this. String.toLowerCase() without an argument is locale-dependent and is not. The widely repeated advice "never use toLowerCase for comparison" comes from exactly here.

Greek final sigma

Greek has two lowercase sigmas: σ (U+03C3) inside a word, ς (U+03C2) at the end. Uppercase is a single letter Σ (U+03A3). Lowercasing requires looking ahead: a Σ followed by another letter becomes σ; a Σ followed by a word boundary becomes ς. This is one of the entries in SpecialCasing.txt — a context-sensitive rule that is correct only in the Greek locale.

"ΣΥΣΤΗΜΑ".toLowerCase()              // "σύστημα" — correct in en-US too,
                                      //   because the rule is final-position
"ΣΥΣΤΗΜΑΤΑ".toLowerCase()             // "συστήματα" — note both internal σ
"ΣΥΣΤΗΜΑ".toLocaleLowerCase('el')     // "σύστημα"

The Σ-to-σ-or-ς rule is one of the few default-locale context-sensitive case rules that JavaScript engines apply unconditionally. Stripping it would silently break Greek text in any browser.

Dutch IJ and titlecase

Dutch treats the digraph ij as a single unit. At the start of a sentence the convention is to capitalise both letters: IJsland, not Ijsland. Unicode includes a single precomposed codepoint for the digraph (U+0132 / U+0133), and a number of locale-aware tools handle the two-letter sequence specially in titlecasing. Most modern environments do not, treating "ijsland".toTitleCase() as Ijsland in any locale. Strictly correct Dutch titlecasing requires either the precomposed codepoint or a locale-aware tailoring.

The Unicode Collation Algorithm

Sorting Unicode strings by codepoint value produces nonsense. Z is U+005A, a is U+0061, so codepoint order puts every uppercase letter before every lowercase one. é is U+00E9, which puts every accented letter after the entire ASCII range. ñ sorts after n in English but ought to sort between n and o as a distinct letter in Spanish (in older conventions) or as a variant of n (in modern conventions). The Unicode Collation Algorithm (UTS #10, UCA) provides a layered comparison that captures these distinctions correctly.

The UCA assigns each character a sort key with up to four weight levels:

L1 — Primary: Base letter. a, b, c, d, e have distinct L1 weights; a, A, à, á, ä, ǎ all share the L1 weight of a.
L2 — Secondary: Diacritics. a < à < á < â < ä at L2 in the standard ordering.
L3 — Tertiary: Case and width. a < A at L3.
L4 — Quaternary: Punctuation, variant differences (full-width vs half-width, etc.).

Comparison proceeds in level order. If two strings differ at L1, that decides the order. If they tie at L1, L2 decides. If they tie at L2, L3 decides. The result is that case differences and accent differences are recognised but subordinated to base-letter differences, so café sorts next to cafe and not far away from it, while cafe still comes before caff.

Implementations almost always use ICU. JavaScript exposes the algorithm through Intl.Collator:

const list = ['Zoë', 'Zoe', 'Zora', 'aardvark', 'Åke'];

list.sort();
// ["Zoe", "Zora", "Zoë", "aardvark", "Åke"] — codepoint order, wrong

list.sort(new Intl.Collator('en').compare);
// ["aardvark", "Åke", "Zoe", "Zoë", "Zora"] — UCA at L3

Locale tailorings

The default UCA ordering is a starting point. Every locale tailors it. The most visible cases:

Swedish (sv): Sorts å, ä, ö as distinct letters after z. Åke sorts after Zoë, not between A and B.
German (de): Two variants. Phonebook order folds ä to ae, ö to oe, ü to ue; dictionary order treats them as a, o, u with a secondary difference.
Spanish (es): Historically treated ch as a single letter sorting after c, and ll as a single letter sorting after l. The Real Academia abolished this in 1994; modern Spanish locales sort by individual letter. Some legacy collations still apply the old rule.
Czech (cs): č sorts after c, ch sorts as a single letter after h, ř sorts after r, š after s, ž after z.
Vietnamese (vi): Distinguishes diacritic-marked vowels as distinct letters with their own primary weight; tone marks sort at a secondary level.
Japanese (ja): Sorts by kana reading. CJK ideographs without a registered reading sort by radical-stroke, then codepoint.

None of these tailorings are derivable from the default UCA ordering. They live in CLDR (the Unicode Common Locale Data Repository) and are applied through Intl.Collator with the appropriate locale tag.

Codepoint order versus collation order

Storing strings in codepoint order — what every binary B-tree or array sort does by default — is useful for one thing only: deterministic equality comparison and indexing. It is not useful for display. Any user-facing list of strings should be ordered by a collator.

["École", "École", "Ecole", "Ezra"].sort();
// codepoint order — uppercase first, then accented:
// ["Ecole", "Ezra", "École", "École"]

["École", "École", "Ecole", "Ezra"]
  .sort(new Intl.Collator('fr').compare);
// French locale UCA:
// ["Ecole", "École", "École", "Ezra"]

The collator also offers sensitivity options to flatten levels. sensitivity: 'base' compares at L1 only (case- and accent-insensitive). sensitivity: 'accent' compares at L1+L2 (case-insensitive, accent-sensitive). sensitivity: 'case' compares at L1+L3 (accent-insensitive, case-sensitive). sensitivity: 'variant' is the default and compares at all four levels.

Collation is what equality looks like when you grade it by how much the user cares about the difference.

Practical recommendations

Comparing identifiers: Case-fold with the root locale, normalise to NFKC, apply the IdentifierStatus filter if accepting Unicode identifiers. Never use the system default locale.
Displaying lists: Sort with Intl.Collator in the user's locale. The default sensitivity is correct for almost every list a user reads.
Searching: Indexed by L1-only collation key plus NFKC for fuzziness; full collation key for ranking.
Storing usernames: NFKC, case fold, then check for confusables under UTS #39. Reject collisions on the folded key, not the displayed name.
Database collations: PostgreSQL, MySQL, SQL Server all expose ICU-based collations. Pick a stable one (an explicit ICU locale, not the OS default) so sort order is portable across machines.

The combination of NFC for storage, NFKC + case-fold for identifiers, and Intl.Collator with a named locale for display order covers the great majority of real applications. Resist the temptation to roll any of these by hand; the data tables are large and the edge cases are unforgiving.

What to remember

Case mapping can change string length, can depend on locale, and can depend on context within the string. Case folding is the right operation for comparison and is not the same as lowercase. The Turkish I and Greek final sigma are the two unmissable rules; the IJ digraph and the German sharp s are the ones to know about. Sorting by codepoint is wrong; the Unicode Collation Algorithm gives layered comparisons that match how readers expect letters to sort, with locale tailorings layered on top through CLDR. Use the library, do not roll your own.

The four case operations

Simple versus full case mapping

Case folding

The Turkish I

Greek final sigma

Dutch IJ and titlecase

The Unicode Collation Algorithm

Locale tailorings

Codepoint order versus collation order

Practical recommendations

What to remember

Further reading