The same word can be spelled in Unicode in more than one way, and the user typing it has no idea which spelling their keyboard chose. Normalization is the standardised process of putting a string into a single canonical spelling, so that two strings the user thinks are the same actually compare as equal. The full specification is Unicode Standard Annex #15. The version that matters in code is the one-line decision: which form, NFC, NFD, NFKC, or NFKD?
Two kinds of equivalence
Unicode defines two relations on strings:
- Canonical equivalence
- Two strings represent the same abstract character. é as one codepoint (U+00E9) and é as two (U+0065 U+0301) are canonically equivalent — the standard requires that conforming software treat them as the same character.
- Compatibility equivalence
- Two strings represent the same character in a looser sense that may lose formatting distinctions. The superscript digit ² (U+00B2) is compatibility-equivalent to the digit 2 (U+0032). The full-width Latin letter A (U+FF21) is compatibility-equivalent to A (U+0041). The decomposition discards visual distinctions deliberately.
The four normalization forms are the cross product of which equivalence with composed or decomposed result:
| Form | Equivalence | Result shape | Use case |
|---|---|---|---|
| NFD | Canonical | Decomposed | Per-character analysis, accent stripping. |
| NFC | Canonical | Composed | Storage, interchange, transmission. The W3C default. |
| NFKD | Compatibility | Decomposed | Search indexes, fuzzy match. |
| NFKC | Compatibility | Composed | Identifier comparison, login systems. |
The four forms on café
Start with the string café typed in the worst possible way: the e followed by a combining acute accent.
| Input | Codepoints |
|---|---|
| café (as typed) | U+0063 U+0061 U+0066 U+0065 U+0301 |
| Form | Codepoints out | Length |
|---|---|---|
| NFD | U+0063 U+0061 U+0066 U+0065 U+0301 | 5 |
| NFC | U+0063 U+0061 U+0066 U+00E9 | 4 |
| NFKD | U+0063 U+0061 U+0066 U+0065 U+0301 | 5 |
| NFKC | U+0063 U+0061 U+0066 U+00E9 | 4 |
For pure-Latin text, NFC and NFKC produce identical results, as do NFD and NFKD. The compatibility forms only differ from the canonical forms when a character has a compatibility decomposition. The classic examples follow.
The four forms on ffi
U+FB03 is the Latin small ligature ffi, a single codepoint for the historic typographic ligature. Its canonical decomposition is empty — there is no canonically equivalent multi-codepoint form. Its compatibility decomposition is three separate letters.
| Form | Result |
|---|---|
| Input | ffi (U+FB03) |
| NFD | ffi (U+FB03) — unchanged |
| NFC | ffi (U+FB03) — unchanged |
| NFKD | ffi (U+0066 U+0066 U+0069) |
| NFKC | ffi (U+0066 U+0066 U+0069) |
The compatibility forms restore the ligature to three separate letters, which is what you want if you are searching for office inside a document where someone has typed office. NFKC is the form used by IDNA 2008 for internationalised domain names, partly to prevent ligatures from being used as visual disguises for ASCII.
The four forms on ½
U+00BD VULGAR FRACTION ONE HALF behaves similarly. The canonical forms preserve it; the compatibility forms decompose it into digit, fraction slash, digit:
| Form | Result |
|---|---|
| Input | ½ (U+00BD) |
| NFD / NFC | ½ (U+00BD) — unchanged |
| NFKD / NFKC | 1⁄2 (U+0031 U+2044 U+0032) |
Note that NFKC does not produce the ASCII string 1/2 — the slash U+2044 is FRACTION SLASH, not ASCII solidus. NFKC is a compatibility decomposition; it removes visual distinctions but does not promote characters across the digit/symbol boundary in ways that would lose semantics. (The compatibility decomposition is defined per character in the Unicode Character Database and is not user-tailorable.)
Other revealing decompositions
A small gallery of cases where NFKC differs from NFC:
| Input | Codepoint | NFKC | Codepoints out |
|---|---|---|---|
| A (full-width A) | U+FF21 | A | U+0041 |
| ² (superscript 2) | U+00B2 | 2 | U+0032 |
| カ (half-width katakana KA) | U+FF76 | カ | U+30AB |
| ℡ (telephone sign) | U+2121 | TEL | U+0054 U+0045 U+004C |
| ㎏ (square kg) | U+338F | kg | U+006B U+0067 |
| 𝐀 (mathematical bold A) | U+1D400 | A | U+0041 |
The math alphanumerics in particular — every styled letter from U+1D400 to U+1D7FF — decomposes to the plain ASCII letter under NFKC. This is why a user-name field that runs NFKC will see 𝐀𝐝𝐦𝐢𝐧 and Admin as identical.
When to normalize
- Store and transmit
- Use NFC. It is the form the W3C and the IETF specify for HTML and protocol identifiers. Browsers do not normalize HTML automatically; the W3C Character Model document recommends that authoring tools save content as NFC. macOS notoriously stores filenames in NFD, which is the source of many cross-platform bugs — a file named café.txt on macOS may not match the same name on Linux when the latter expects NFC.
- Compare and search
- Use NFC at minimum on both sides. Use NFKC if you want compatibility-equivalent strings to match (full-width vs half-width, ligatures vs letters, styled vs unstyled).
- Login systems
- Apply NFKC plus case-folding (Unicode's case-insensitive comparison, not ASCII
tolower). This is what IDNA 2008 and the PRECIS framework (RFC 8264) specify for identifiers. - Password fields
- RFC 8265 (PRECIS OpaqueString) prescribes NFC and a specific allowed-character profile. Do not case-fold passwords.
- Accent-insensitive search
- Apply NFD, then strip combining marks (codepoints in the Mn category). "café" NFD-decomposed becomes "cafe" + combining acute, after which removing combining marks leaves "cafe".
Most languages provide normalization in the standard library: JavaScript's String.prototype.normalize(form), Python's unicodedata.normalize(form, s), Java's java.text.Normalizer, Swift's String.unicodeScalars + ICU. The form argument is the literal string "NFC", "NFD", "NFKC", or "NFKD".
The IDN homograph attack
The most cited security consequence of skipping normalization is the internationalised domain name homograph attack. Consider these two strings:
"apple.com" ASCII Latin letters U+0061 U+0070 ...
"аpple.com" first letter is Cyrillic а (U+0430 U+0070 ...)
The Cyrillic small letter A (U+0430) is visually indistinguishable from the Latin small letter A (U+0061) in most fonts, but they are distinct codepoints. Neither NFC nor NFKC turns one into the other — they have no shared decomposition. To prevent this, IDNA 2008 layers an additional mixed-script check on top of NFKC: a label containing characters from multiple scripts is rejected, and a label that looks confusable is also rejected (UTS #39, the Unicode Security Mechanisms document, provides the confusables data file).
Modern browsers display IDN names in Punycode (the all-ASCII form starting with xn--) whenever the registered name contains scripts the user's browser is not configured to expect. The first letter of the malicious example above renders as xn--pple-43d.com in Chrome and Firefox, not as a clickable lookalike.
Normalization is not a security feature by itself. It is the precondition for comparing strings safely. The actual rules for what is allowed in identifiers — scripts, confusables, mixed-script labels — live in UTS #39 and IDNA 2008, layered on top.
What to remember
If you take only one rule from this page: store text as NFC and compare it as NFC. If your system has identifiers (usernames, filenames, hostnames), apply NFKC plus the appropriate identifier profile from PRECIS or IDNA 2008. Run the normalizer on any string you are not sure about before storing it.
Further reading
- Unicode normalizer — paste a string and see all four forms side by side.
- Character inspector — see exactly which codepoints make up a string.
- Codepoint, character, glyph, grapheme — the conceptual basis for canonical equivalence.
- é U+00E9 — the precomposed letter from the running example.
- Bidirectional text & RTL — another layer that looks one way and behaves another.
- What is Unicode? — the standard the equivalence relations are defined in.