Every character page, block page, and lookup on this site is generated from the public Unicode Character Database (UCD) for version 16.0, the release published by the Unicode Consortium on September 10, 2024. All files are downloaded directly from https://www.unicode.org/Public/16.0.0/ucd/ and the emoji subdirectory at the same versioned root. Nothing on this site is hand-typed from a third-party source; if a fact appears here, it is traceable to one of the files below.

The UCD is a set of plain-text, semicolon-delimited tables. Each file documents one slice of every codepoint's properties. The site's generator parses these files and produces one HTML page per character, block, category, and script. When the Unicode Consortium releases a new version, the build is re-run and the changelog is updated.

Core character properties

UnicodeData.txt
UCD 16.0. The primary property file. One row per assigned codepoint with the official name, general category, canonical combining class, bidi class, decomposition mapping, numeric value (where applicable), the mirrored flag, and case mappings (uppercase, lowercase, titlecase). This is the single most important file in the database.
Blocks.txt
Defines the 327 named blocks of Unicode 16.0 and their codepoint ranges. The source for the blocks index and every block detail page.
Scripts.txt
Assigns a script property to every codepoint (Latin, Greek, Devanagari, Hiragana, Common, Inherited, etc.). This is what powers the scripts section.
PropList.txt
Boolean character properties: White_Space, Bidi_Control, Join_Control, Dash, Hyphen, Quotation_Mark, Terminal_Punctuation, Hex_Digit, and many more. Used for the property tags on character pages.
DerivedAge.txt
The Unicode version in which each codepoint was first assigned. The source for the "Added in" field on character pages.
NamesList.txt
The formal names file with aliases, cross-references, and informative annotations. This is where the "see also" notes, the formal alias names, and the editorial comments under each codepoint in The Unicode Standard live.

Emoji properties

emoji-data.txt
Unicode Emoji 16.0. Marks codepoints with the boolean properties Emoji, Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base, Emoji_Component, and Extended_Pictographic. Drives the emoji category.
emoji-sequences.txt
The catalogue of valid emoji ZWJ sequences (family, profession, gender), modifier sequences (skin-tone variants), and flag sequences (regional indicator pairs). Background reading: how emoji work.

Normalization and case

DerivedNormalizationProps.txt
Quick-check tables for NFC, NFD, NFKC, and NFKD. Used by the normalizer tool and the worked examples in the normalization guide.
CaseFolding.txt
The mappings used for case-insensitive matching. Full case folding handles non-trivial cases like German ß folding to "ss" and the Turkish dotted/dotless i.

Segmentation

LineBreak.txt
Line break properties from UAX #14: Unicode Line Breaking Algorithm. Tells a renderer whether each codepoint can begin a line, end a line, must be kept with the next, and so on.
GraphemeBreakProperty.txt
From UAX #29: Unicode Text Segmentation. Used to split a string into user-perceived characters (grapheme clusters) — including ZWJ sequences and combining-mark clusters.
WordBreakProperty.txt
Also from UAX #29. Used for word-boundary detection in search and selection.
SentenceBreakProperty.txt
The third UAX #29 table, for splitting text into sentences. Useful for read-aloud, paginators, and natural-language processing.

Refresh cadence

The data is refreshed quarterly and after every Unicode point release. The next major target is Unicode 17.0, expected September 2025; that build will add the new characters and update the property tables. Smaller intermediate refreshes pick up corrections in NamesList.txt and emoji catalogue additions in the interim. The changelog records every refresh, and the about page covers the editorial approach for any prose that is added alongside the data.

If you spot a number on this site that does not match the file it claims to come from, please report it. Data corrections take priority over everything else.