An HTML character reference is a piece of source text — written with ampersands and semicolons — that the browser parses into a single character before rendering. They were essential in 1995 when documents arrived as ASCII or Latin-1 and there was no way to talk about U+2014 EM DASH except through an escape. Today, with UTF-8 the default and <meta charset="utf-8"> conventional, character references are mostly a fallback. There are exactly four characters you still must escape, and a small number of contexts where escapes are convenient. The rest is history.

The three forms

HTML defines three syntactic forms for a character reference:

Named
An ampersand, a name, and a semicolon. Around 2,231 named references in HTML5. Example: &euro; for €.
Decimal numeric
An ampersand, a hash, decimal digits of the codepoint, and a semicolon. Example: &#8364; for €.
Hexadecimal numeric
An ampersand, hash-x, hex digits of the codepoint, and a semicolon. Example: &#x20AC; for €.

All three produce the same parsed character in any modern browser. The leading & begins a reference; the ; ends it. Inside the numeric forms, leading zeros are permitted (&#x0020AC; works); inside the named form, capitalisation matters for most names.

The four you actually still need

If your document is served as UTF-8 with the correct content-type or meta charset, you can write the character directly for almost everything. The exceptions are the four characters that have a syntactic role in HTML and would confuse the parser if left literal:

CharacterWhy escapeNamedDecimalHex
&Starts a character reference&amp;&#38;&#x26;
<Starts a tag&lt;&#60;&#x3C;
>Ambiguous in some legacy contexts&gt;&#62;&#x3E;
"Closes attribute values&quot;&#34;&#x22;

The single quote ' is also worth escaping (as &#39;) inside attributes delimited with single quotes. The named form &apos; exists in HTML5 but did not exist in HTML 4 and was unsafe in older browsers; the numeric form is universally supported.

Everything else — accented letters, currency signs, em dashes, emoji — can be written directly in UTF-8 source. The browser parses your literal € exactly as it would parse &euro;. The choice between them is editorial, not technical.

The 30 most useful named entities

GlyphNamedCodepointWhat it is
&&amp;U+0026Ampersand
<&lt;U+003CLess-than
>&gt;U+003EGreater-than
"&quot;U+0022Quotation mark
'&apos;U+0027Apostrophe
 &nbsp;U+00A0Non-breaking space
©&copy;U+00A9Copyright sign
®&reg;U+00AERegistered sign
&trade;U+2122Trade mark sign
&euro;U+20ACEuro sign
£&pound;U+00A3Pound sign
¥&yen;U+00A5Yen sign
°&deg;U+00B0Degree sign
±&plusmn;U+00B1Plus-minus sign
×&times;U+00D7Multiplication sign
÷&divide;U+00F7Division sign
&mdash;U+2014Em dash
&ndash;U+2013En dash
&hellip;U+2026Horizontal ellipsis
"&ldquo;U+201CLeft double quotation mark
"&rdquo;U+201DRight double quotation mark
'&lsquo;U+2018Left single quotation mark
'&rsquo;U+2019Right single quotation mark
«&laquo;U+00ABLeft guillemet
»&raquo;U+00BBRight guillemet
§&sect;U+00A7Section sign
&para;U+00B6Pilcrow
&bull;U+2022Bullet
&larr;U+2190Leftwards arrow
&rarr;U+2192Rightwards arrow

The full HTML5 named-entity list is maintained at html.spec.whatwg.org/entities.json and contains exactly 2,231 entries. Many are obscure mathematical or technical symbols; in practice almost everyone uses fewer than thirty.

HTML, XML, XHTML — the differences

The three syntaxes treat named entities differently and the differences matter when files cross between contexts.

HTML5
~2,231 named entities are recognised. The DOCTYPE is informational, not a DTD reference. Named entities are part of the parser's hard-coded table.
XML 1.0
Only five named entities are predefined: &amp;, &lt;, &gt;, &quot;, &apos;. Any additional names must be declared in a DTD using <!ENTITY> declarations, otherwise the XML parser raises a well-formedness error.
XHTML 1.0
An XML application that imports the HTML 4 entity sets by referencing the public DTD (-//W3C//DTD XHTML 1.0 Strict//EN). An XHTML file without the DOCTYPE declaration cannot use &nbsp; or any other HTML name without a parse error.
SVG (in HTML)
Parsed by the HTML parser; the full HTML named-entity set is available.
SVG (standalone, served as image/svg+xml)
Parsed by the XML parser; only the five XML names are available. &nbsp; in a standalone SVG will break the file.
Inside an HTML document:
  <p>Hello &nbsp; world</p>          ← fine

Inside a standalone SVG served as image/svg+xml:
  <text>Hello &nbsp; world</text>    ← XML parse error
  <text>Hello &#xA0; world</text>    ← works (numeric reference)

When portability across HTML and XML matters — typically for SVG, RSS, Atom, and JSON-embedded XML — prefer the numeric forms. They are valid in every XML application without DTD declarations.

Attribute contexts and the encoding rules

The OWASP cross-site scripting cheat sheet treats HTML escaping as a context-sensitive operation. The character to escape depends on where the value will appear:

HTML element content
Escape & < >. (" and ' are not required here but harmless.)
HTML double-quoted attribute
Escape & and ".
HTML single-quoted attribute
Escape & and '.
HTML unquoted attribute
Escape &, ", ', space, tab, newline, =, <, >, backtick. Or, more practically, always quote your attributes.
URL
Use percent-encoding (RFC 3986), not HTML entities. See the URL encoder.
JavaScript string
Use JavaScript string escapes (\xHH, \uHHHH, \u{HHHHHH}), not HTML entities. The HTML parser does not run inside <script>.
CSS value
Use CSS escapes (\HHHHHH with hex digits and an optional trailing space).

The general rule: HTML entities are for HTML. URLs need percent-encoding, JavaScript needs \u escapes, CSS needs \ escapes. Mixing them is the source of an embarrassing share of XSS bugs.

Numeric vs named — which to use

Named entities are easier to read and harder to mistype: &mdash; reads better in source than &#8212;. They are also slightly longer to transmit, but every reasonable compression algorithm collapses that difference. The arguments against named entities are:

  • They are not portable to XML without DTD declarations.
  • The named-entity table grew gradually and some names predate Unicode (the &OElig; name for Œ, for instance, is older than the official Unicode name).
  • A few have unexpected meanings — &Theta; is the Greek capital theta (U+0398), not the math symbol.

For ordinary editorial use inside an HTML5 document, named entities are fine. For machine-generated content, library APIs that produce escaped output, and any context where XML compatibility matters, prefer numeric. For the four syntactic characters (& < > "), either form is acceptable; named is conventional.

The number of HTML named entities is large and grows in odd places. Some pairs share a glyph but differ in semantics: &empty; and &varnothing; both render as ∅ (U+2205). When choosing between two names for the same glyph, use the named entity whose Unicode codepoint matches the meaning you want.

What to remember

With UTF-8 source and <meta charset="utf-8">, you need to escape four characters in HTML: &, <, >, and ". Everything else is editorial preference. For XML and standalone SVG, prefer numeric references. For URLs and scripts, use the encoding mechanisms appropriate to those contexts. The HTML entity encoder converts strings between literal, named, decimal, and hexadecimal forms in either direction.

Further reading