Easy Tutorial
❮ Ref Html Entities X Ref Utf Latin Extended A ❯

HTML Unicode (UTF-8) Reference Manual


Unicode Consortium

The Unicode Consortium develops the Unicode Standard. Their goal is to replace existing character sets with the standard Unicode Transformation Format (UTF).

The Unicode Standard has been widely adopted and is implemented in HTML, XML, Java, JavaScript, email, ASP, and PHP. It is also supported by many operating systems and all modern browsers.

The Unicode Consortium collaborates with leading standard development organizations such as ISO, W3C, and ECMA.


Unicode Character Set

Unicode can be implemented by different character sets. The most commonly used encodings are UTF-8 and UTF-16:

Character Set Description
UTF-8 Characters in UTF-8 can be 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backward compatible with ASCII. UTF-8 is the preferred encoding for email and web pages.
UTF-16 The 16-bit Unicode Transformation Format is a variable-length Unicode character encoding that can encode the entire Unicode instruction set. UTF-16 is mainly used in operating systems and environments such as Microsoft Windows, Java, and .NET.

Tip: The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded text.

Tip: All HTML 4 processors support UTF-8, and all HTML 5 and XML processors support UTF-8 and UTF-16!


HTML5 Standard: Unicode UTF-8

Due to the limited size of character sets in ISO-8859 and their incompatibility in multilingual environments, the Unicode Consortium developed the Unicode Standard.

The Unicode Standard covers (almost) all characters, punctuation marks, and symbols.

Unicode allows text processing, storage, and transport to be independent of platform and language.

The default character encoding in HTML-5 is UTF-8.

Below are some UTF-8 character sets supported by HTML5:

Character Set Decimal Hexadecimal
C0 Controls and Basic Latin 0-127 0000-007F
C1 Controls and Latin-1 Supplement 128-255 0080-00FF
Latin Extended-A 256-383 0100-017F
Latin Extended-B 384-591 0180-024F

If an HTML5 page uses a character set other than UTF-8, it needs to be specified in the <meta> tag, as follows:

Example

English:

❮ Ref Html Entities X Ref Utf Latin Extended A ❯