Unicode normalization explained

The same word can be spelled in Unicode in more than one way, and the user typing it has no idea which spelling their keyboard chose. Normalization is the standardised process of putting a string into a single canonical spelling, so that two strings the user thinks are the same actually compare as equal. The full specification is Unicode Standard Annex #15. The version that matters in code is the one-line decision: which form, NFC, NFD, NFKC, or NFKD?

Two kinds of equivalence

Unicode defines two relations on strings:

Canonical equivalence: Two strings represent the same abstract character. é as one codepoint (U+00E9) and é as two (U+0065 U+0301) are canonically equivalent — the standard requires that conforming software treat them as the same character.
Compatibility equivalence: Two strings represent the same character in a looser sense that may lose formatting distinctions. The superscript digit ² (U+00B2) is compatibility-equivalent to the digit 2 (U+0032). The full-width Latin letter Ａ (U+FF21) is compatibility-equivalent to A (U+0041). The decomposition discards visual distinctions deliberately.

The four normalization forms are the cross product of which equivalence with composed or decomposed result:

Form	Equivalence	Result shape	Use case
NFD	Canonical	Decomposed	Per-character analysis, accent stripping.
NFC	Canonical	Composed	Storage, interchange, transmission. The W3C default.
NFKD	Compatibility	Decomposed	Search indexes, fuzzy match.
NFKC	Compatibility	Composed	Identifier comparison, login systems.

The four forms on café

Start with the string café typed in the worst possible way: the e followed by a combining acute accent.

Input	Codepoints
café (as typed)	U+0063 U+0061 U+0066 U+0065 U+0301

Form	Codepoints out	Length
NFD	U+0063 U+0061 U+0066 U+0065 U+0301	5
NFC	U+0063 U+0061 U+0066 U+00E9	4
NFKD	U+0063 U+0061 U+0066 U+0065 U+0301	5
NFKC	U+0063 U+0061 U+0066 U+00E9	4

For pure-Latin text, NFC and NFKC produce identical results, as do NFD and NFKD. The compatibility forms only differ from the canonical forms when a character has a compatibility decomposition. The classic examples follow.

The four forms on ﬃ

U+FB03 is the Latin small ligature ffi, a single codepoint for the historic typographic ligature. Its canonical decomposition is empty — there is no canonically equivalent multi-codepoint form. Its compatibility decomposition is three separate letters.

Form	Result
Input	ﬃ (U+FB03)
NFD	ﬃ (U+FB03) — unchanged
NFC	ﬃ (U+FB03) — unchanged
NFKD	ffi (U+0066 U+0066 U+0069)
NFKC	ffi (U+0066 U+0066 U+0069)

The compatibility forms restore the ligature to three separate letters, which is what you want if you are searching for office inside a document where someone has typed oﬃce. NFKC is the form used by IDNA 2008 for internationalised domain names, partly to prevent ligatures from being used as visual disguises for ASCII.

The four forms on ½

U+00BD VULGAR FRACTION ONE HALF behaves similarly. The canonical forms preserve it; the compatibility forms decompose it into digit, fraction slash, digit:

Form	Result
Input	½ (U+00BD)
NFD / NFC	½ (U+00BD) — unchanged
NFKD / NFKC	1⁄2 (U+0031 U+2044 U+0032)

Note that NFKC does not produce the ASCII string 1/2 — the slash U+2044 is FRACTION SLASH, not ASCII solidus. NFKC is a compatibility decomposition; it removes visual distinctions but does not promote characters across the digit/symbol boundary in ways that would lose semantics. (The compatibility decomposition is defined per character in the Unicode Character Database and is not user-tailorable.)

Other revealing decompositions

A small gallery of cases where NFKC differs from NFC:

Input	Codepoint	NFKC	Codepoints out
Ａ (full-width A)	U+FF21	A	U+0041
² (superscript 2)	U+00B2	2	U+0032
ｶ (half-width katakana KA)	U+FF76	カ	U+30AB
℡ (telephone sign)	U+2121	TEL	U+0054 U+0045 U+004C
㎏ (square kg)	U+338F	kg	U+006B U+0067
𝐀 (mathematical bold A)	U+1D400	A	U+0041

The math alphanumerics in particular — every styled letter from U+1D400 to U+1D7FF — decomposes to the plain ASCII letter under NFKC. This is why a user-name field that runs NFKC will see 𝐀𝐝𝐦𝐢𝐧 and Admin as identical.

When to normalize

Store and transmit: Use NFC. It is the form the W3C and the IETF specify for HTML and protocol identifiers. Browsers do not normalize HTML automatically; the W3C Character Model document recommends that authoring tools save content as NFC. macOS notoriously stores filenames in NFD, which is the source of many cross-platform bugs — a file named café.txt on macOS may not match the same name on Linux when the latter expects NFC.
Compare and search: Use NFC at minimum on both sides. Use NFKC if you want compatibility-equivalent strings to match (full-width vs half-width, ligatures vs letters, styled vs unstyled).
Login systems: Apply NFKC plus case-folding (Unicode's case-insensitive comparison, not ASCII tolower). This is what IDNA 2008 and the PRECIS framework (RFC 8264) specify for identifiers.
Password fields: RFC 8265 (PRECIS OpaqueString) prescribes NFC and a specific allowed-character profile. Do not case-fold passwords.
Accent-insensitive search: Apply NFD, then strip combining marks (codepoints in the Mn category). "café" NFD-decomposed becomes "cafe" + combining acute, after which removing combining marks leaves "cafe".

Most languages provide normalization in the standard library: JavaScript's String.prototype.normalize(form), Python's unicodedata.normalize(form, s), Java's java.text.Normalizer, Swift's String.unicodeScalars + ICU. The form argument is the literal string "NFC", "NFD", "NFKC", or "NFKD".

The IDN homograph attack

The most cited security consequence of skipping normalization is the internationalised domain name homograph attack. Consider these two strings:

"apple.com"   ASCII Latin letters    U+0061 U+0070 ...
"аpple.com"   first letter is Cyrillic а  (U+0430 U+0070 ...)

The Cyrillic small letter A (U+0430) is visually indistinguishable from the Latin small letter A (U+0061) in most fonts, but they are distinct codepoints. Neither NFC nor NFKC turns one into the other — they have no shared decomposition. To prevent this, IDNA 2008 layers an additional mixed-script check on top of NFKC: a label containing characters from multiple scripts is rejected, and a label that looks confusable is also rejected (UTS #39, the Unicode Security Mechanisms document, provides the confusables data file).

Modern browsers display IDN names in Punycode (the all-ASCII form starting with xn--) whenever the registered name contains scripts the user's browser is not configured to expect. The first letter of the malicious example above renders as xn--pple-43d.com in Chrome and Firefox, not as a clickable lookalike.

Normalization is not a security feature by itself. It is the precondition for comparing strings safely. The actual rules for what is allowed in identifiers — scripts, confusables, mixed-script labels — live in UTS #39 and IDNA 2008, layered on top.

What to remember

If you take only one rule from this page: store text as NFC and compare it as NFC. If your system has identifiers (usernames, filenames, hostnames), apply NFKC plus the appropriate identifier profile from PRECIS or IDNA 2008. Run the normalizer on any string you are not sure about before storing it.

Two kinds of equivalence

The four forms on café

The four forms on ﬃ

The four forms on ½

Other revealing decompositions

When to normalize

The IDN homograph attack

What to remember

Further reading