HTML Unicode (UTF-8)
Reference Manual
Unicode Consortium
The Unicode Consortium develops the Unicode Standard. Their goal is to replace existing character sets with the standard Unicode Transformation Format (UTF).
The Unicode Standard has been widely adopted and is implemented in HTML, XML, Java, JavaScript, email, ASP, and PHP. It is also supported by many operating systems and all modern browsers.
The Unicode Consortium collaborates with leading standard development organizations such as ISO, W3C, and ECMA.
Unicode Character Set
Unicode can be implemented by different character sets. The most commonly used encodings are UTF-8 and UTF-16:
Character Set | Description |
---|---|
UTF-8 | Characters in UTF-8 can be 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backward compatible with ASCII. UTF-8 is the preferred encoding for email and web pages. |
UTF-16 | The 16-bit Unicode Transformation Format is a variable-length Unicode character encoding that can encode the entire Unicode instruction set. UTF-16 is mainly used in operating systems and environments such as Microsoft Windows, Java, and .NET. |
Tip: The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded text.
Tip: All HTML 4 processors support UTF-8, and all HTML 5 and XML processors support UTF-8 and UTF-16!
HTML5 Standard: Unicode UTF-8
Due to the limited size of character sets in ISO-8859 and their incompatibility in multilingual environments, the Unicode Consortium developed the Unicode Standard.
The Unicode Standard covers (almost) all characters, punctuation marks, and symbols.
Unicode allows text processing, storage, and transport to be independent of platform and language.
The default character encoding in HTML-5 is UTF-8.
Below are some UTF-8 character sets supported by HTML5:
Character Set | Decimal | Hexadecimal |
---|---|---|
C0 Controls and Basic Latin | 0-127 | 0000-007F |
C1 Controls and Latin-1 Supplement | 128-255 | 0080-00FF |
Latin Extended-A | 256-383 | 0100-017F |
Latin Extended-B | 384-591 | 0180-024F |
If an HTML5 page uses a character set other than UTF-8, it needs to be specified in the <meta>
tag, as follows:
Example
English: