Every character page, block page, and lookup on this site is generated from the public Unicode Character Database (UCD) for version 16.0, the release published by the Unicode Consortium on September 10, 2024. All files are downloaded directly from https://www.unicode.org/Public/16.0.0/ucd/ and the emoji subdirectory at the same versioned root. Nothing on this site is hand-typed from a third-party source; if a fact appears here, it is traceable to one of the files below.
The UCD is a set of plain-text, semicolon-delimited tables. Each file documents one slice of every codepoint's properties. The site's generator parses these files and produces one HTML page per character, block, category, and script. When the Unicode Consortium releases a new version, the build is re-run and the changelog is updated.
Core character properties
- UnicodeData.txt
- UCD 16.0. The primary property file. One row per assigned codepoint with the official name, general category, canonical combining class, bidi class, decomposition mapping, numeric value (where applicable), the mirrored flag, and case mappings (uppercase, lowercase, titlecase). This is the single most important file in the database.
- Blocks.txt
- Defines the 327 named blocks of Unicode 16.0 and their codepoint ranges. The source for the blocks index and every block detail page.
- Scripts.txt
- Assigns a script property to every codepoint (Latin, Greek, Devanagari, Hiragana, Common, Inherited, etc.). This is what powers the scripts section.
- PropList.txt
- Boolean character properties:
White_Space,Bidi_Control,Join_Control,Dash,Hyphen,Quotation_Mark,Terminal_Punctuation,Hex_Digit, and many more. Used for the property tags on character pages. - DerivedAge.txt
- The Unicode version in which each codepoint was first assigned. The source for the "Added in" field on character pages.
- NamesList.txt
- The formal names file with aliases, cross-references, and informative annotations. This is where the "see also" notes, the formal alias names, and the editorial comments under each codepoint in The Unicode Standard live.
Emoji properties
- emoji-data.txt
- Unicode Emoji 16.0. Marks codepoints with the boolean properties
Emoji,Emoji_Presentation,Emoji_Modifier,Emoji_Modifier_Base,Emoji_Component, andExtended_Pictographic. Drives the emoji category. - emoji-sequences.txt
- The catalogue of valid emoji ZWJ sequences (family, profession, gender), modifier sequences (skin-tone variants), and flag sequences (regional indicator pairs). Background reading: how emoji work.
Normalization and case
- DerivedNormalizationProps.txt
- Quick-check tables for NFC, NFD, NFKC, and NFKD. Used by the normalizer tool and the worked examples in the normalization guide.
- CaseFolding.txt
- The mappings used for case-insensitive matching. Full case folding handles non-trivial cases like German ß folding to "ss" and the Turkish dotted/dotless i.
Segmentation
- LineBreak.txt
- Line break properties from UAX #14: Unicode Line Breaking Algorithm. Tells a renderer whether each codepoint can begin a line, end a line, must be kept with the next, and so on.
- GraphemeBreakProperty.txt
- From UAX #29: Unicode Text Segmentation. Used to split a string into user-perceived characters (grapheme clusters) — including ZWJ sequences and combining-mark clusters.
- WordBreakProperty.txt
- Also from UAX #29. Used for word-boundary detection in search and selection.
- SentenceBreakProperty.txt
- The third UAX #29 table, for splitting text into sentences. Useful for read-aloud, paginators, and natural-language processing.
Refresh cadence
The data is refreshed quarterly and after every Unicode point release. The next major target is Unicode 17.0, expected September 2025; that build will add the new characters and update the property tables. Smaller intermediate refreshes pick up corrections in NamesList.txt and emoji catalogue additions in the interim. The changelog records every refresh, and the about page covers the editorial approach for any prose that is added alongside the data.
If you spot a number on this site that does not match the file it claims to come from, please report it. Data corrections take priority over everything else.