TOOL · NORMALIZATION

Unicode normalizer

Run any string through all four Unicode normalization forms at once. See which ones change the input, and by how much.

How it works

The four normalization forms are defined by Unicode Standard Annex #15. They give every Unicode string a canonical representation so that visually-identical strings can be compared as equal. The tool runs the input through each form using the platform's String.prototype.normalize method, then compares each result to the input and reports the differences.

NFD — Canonical Decomposition. Every precomposed character is broken into a base character plus combining marks, in a canonical order. é (U+00E9) becomes e + U+0301 COMBINING ACUTE ACCENT.
NFC — Canonical Composition. The decomposed form is then recomposed using the canonical composition rules, restoring single codepoints where they exist. This is the form most systems prefer because it's shorter and matches user intuition: é stays as é.
NFKD — Compatibility Decomposition. Like NFD but additionally decomposes characters that have a compatibility mapping — visually-distinct characters that share an underlying identity. The ligature ﬁ (U+FB01) becomes f + i. Superscripts and subscripts lose their formatting. Half-width and full-width forms collapse to their normal-width counterparts.
NFKC — Compatibility Composition. Apply NFKD, then recompose. This is the form to use for identifier matching, search keys, and other contexts where you want ﬁ and fi to compare equal but you don't want unnecessary decomposition.

The two key principles to remember: canonical normalization preserves visual identity and round-trips losslessly; compatibility normalization additionally collapses formatting-only distinctions and can throw away information. Use NFC for storage, NFKC for matching.

When to normalize

Three situations almost always call for normalization:

Comparison and deduplication. If you're checking whether two strings are "the same" — for example, in a username uniqueness check — apply NFC (or NFKC, depending on policy) to both before comparing. Otherwise a user can register both café (precomposed) and café (decomposed) as distinct accounts.
Search. Index documents in a single normalization form and normalize queries the same way. NFKC is common for search because it bridges visual variants like ligatures.
Storage. Databases generally don't care, but consumers downstream do. Apple's HFS+ filesystem used to silently NFD-normalize filenames, which broke a generation of tools that assumed NFC. APFS no longer normalizes, but the legacy bug taught a lot of people the hard way.

Don't normalize blindly. NFKC will turn the mathematical italic letter 𝑎 (U+1D44E) into a plain ASCII a — that's the right behaviour for identifier matching, but the wrong behaviour for math typesetting. Know which property you care about.

Worked example

The input cafe + U+0301 — ﬁve — ① contains:

cafe + U+0301 — a base letter plus a combining acute accent. NFC composes it to café (4 codepoints instead of 5). NFD and NFKD leave it as-is. NFKC composes it to café.
ﬁ (U+FB01) — the fi ligature. NFC and NFD leave it alone. NFKC and NFKD decompose it to f + i (2 codepoints instead of 1).
① (U+2460) — circled digit one. NFC and NFD preserve it. NFKC and NFKD strip the circle, giving plain 1.

That's a single string with three different normalization signatures. The byte-length display under each pane makes the size changes concrete: NFD-forms are usually longer (more codepoints), NFKC and NFKD discard formatting and can be either longer or shorter depending on the inputs.

Unicode normalization explained — the long-form guide
Character inspector — see which codepoints actually exist in a string
Codepoint converter
UTF-8 encoder
Codepoint, character, glyph, grapheme
What is Unicode?
Latin-1 Supplement — home of the most-normalized accented letters
General Punctuation

How it works

When to normalize

Worked example

Related