HTML Character Set
HTML Character Set
To display an HTML page correctly, a browser must know which character set to use.
The early character set widely used on the World Wide Web was ASCII. ASCII supported numbers 0-9, the upper and lower case English alphabet, and some special characters.
Complete ASCII Reference Manual.
Since many countries use characters that do not belong to ASCII, the default character set for modern browsers is ISO-8859-1.
Complete ISO-8859-1 Reference Manual.
If a web page uses a character set different from ISO-8859-1, it should be specified in the <meta> tag.
ISO Character Sets
ISO character sets are standards defined by the International Organization for Standardization (ISO) for different alphabets/languages.
Below is a list of different character sets used around the world:
Character Set | Description | Usage |
---|---|---|
ISO-8859-1 | Latin alphabet part 1 | North America, Western Europe, Latin America, Caribbean, Canada, Africa |
ISO-8859-2 | Latin alphabet part 2 | Eastern Europe |
ISO-8859-3 | Latin alphabet part 3 | SE Europe, Esperanto, miscellaneous others |
ISO-8859-4 | Latin alphabet part 4 | Scandinavian/Baltic (and others not included in ISO-8859-1) |
ISO-8859-5 | Latin/Cyrillic part 5 | Languages using the ancient Slavic alphabet, such as Bulgarian, Belarusian, Russian, Macedonian |
ISO-8859-6 | Latin/Arabic part 6 | Languages using the Arabic alphabet |
ISO-8859-7 | Latin/Greek part 7 | Modern Greek, along with Greek-derived mathematical symbols |
ISO-8859-8 | Latin/Hebrew part 8 | Languages using the Hebrew alphabet |
ISO-8859-9 | Latin 5 part 9 | Turkish. Identical to ISO-8859-1 except for Turkish characters replacing the Icelandic ones |
ISO-8859-10 | Latin 6 | Lapland, Germanic, Eskimo North European |
ISO-8859-15 | Latin 9 (aka Latin 0) | Similar to ISO 8859-1, with the Euro symbol and some other characters replacing less commonly used ones |
ISO-2022-JP | Latin/Japanese part 1 | Japanese |
ISO-2022-JP-2 | Latin/Japanese part 2 | Japanese |
ISO-2022-KR | Latin/Korean part 1 | Korean |
Unicode Standard
Due to the capacity limitations and incompatibility with multilingual environments of the character sets listed above, the Unicode Consortium developed the Unicode Standard.
The Unicode Standard covers all the characters, punctuations, and symbols in the world.
Unicode enables processing, storage, and interchange of text data, regardless of the platform, program, or language.
Unicode Consortium
The Unicode Consortium developed the Unicode Standard. Their goal is to replace existing character sets with the standard Unicode Transformation Format (UTF).
The Unicode Standard has been successful and is implemented in XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML. Unicode is also supported in many operating systems and all modern browsers.
The Unicode Consortium collaborates with leading standard development organizations such as ISO, W3C, and ECMA.
Unicode can be compatible with different character sets. The most commonly used encodings are UTF-8 and UTF-16:
Character Set | Description |
---|---|
UTF-8 | Characters in UTF-8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backward compatible with ASCII. UTF-8 is the preferred encoding for web pages and email. |
UTF-16 | 16-bit Unicode Transformation Format is a variable-length Unicode character encoding that can encode the entire Unicode instruction set. UTF-16 is mainly used in operating systems and environments such as Microsoft's Windows 2000/XP/2003/Vista/CE, Java, and .NET byte code environments. |
Tip: The first 256 Unicode character set characters correspond to the 256 ISO-8859-1 characters.
Tip: All HTML 4 browsers support UTF-8, and all XHTML and XML processors support UTF-8 and UTF-16!