HTML Character Set
HTML Character Set
To display an HTML page correctly, a browser must know which character set to use.
The character set used early in the web was ASCII. ASCII supported numbers 0-9, the English alphabet (both uppercase and lowercase), and some special characters.
Complete ASCII Reference Manual.
Since many countries use characters that are not part of ASCII, the default character set for modern browsers is ISO-8859-1.
Complete ISO-8859-1 Reference Manual.
If a web page uses a character set different from ISO-8859-1, it should be specified in the <meta>
tag.
ISO Character Sets
ISO character sets are standards defined by the International Organization for Standardization (ISO) for different alphabets/languages.
Below is a list of different character sets used around the world:
Character Set | Description | Usage |
---|---|---|
ISO-8859-1 | Latin alphabet part 1 | North America, Western Europe, Latin America, Caribbean, Canada, Africa |
ISO-8859-2 | Latin alphabet part 2 | Eastern Europe |
ISO-8859-3 | Latin alphabet part 3 | SE Europe, Esperanto, other miscellaneous |
ISO-8859-4 | Latin alphabet part 4 | Scandinavian/Baltic (and other parts not included in ISO-8859-1) |
ISO-8859-5 | Latin/Cyrillic part 5 | Languages using the ancient Slavic alphabet, such as Bulgarian, Belarusian, Russian, Macedonian |
ISO-8859-6 | Latin/Arabic part 6 | Languages using the Arabic alphabet |
ISO-8859-7 | Latin/Greek part 7 | Modern Greek, and Greek-derived mathematical symbols |
ISO-8859-8 | Latin/Hebrew part 8 | Languages using the Hebrew alphabet |
ISO-8859-9 | Latin 5 part 9 | Turkish. Same as ISO-8859-1 except for Turkish characters replacing Icelandic letters. |
ISO-8859-10 | Latin 6 | Lapland, Germanic, Eskimo Nordic |
ISO-8859-15 | Latin 9 (aka Latin 0) | Similar to ISO 8859-1, with the Euro symbol and some other characters replacing less commonly used symbols |
ISO-2022-JP | Latin/Japanese part 1 | Japanese |
ISO-2022-JP-2 | Latin/Japanese part 2 | Japanese |
ISO-2022-KR | Latin/Korean part 1 | Korean |
Unicode Standard
Due to the capacity limitations and incompatibility with multilingual environments of the character sets listed above, the Unicode Consortium developed the Unicode Standard.
The Unicode Standard covers all characters, punctuation, and symbols in the world.
Unicode enables processing, storage, and interchange of text data, regardless of the platform, program, or language.
Unicode Consortium
The Unicode Consortium developed the Unicode Standard. Their goal is to replace existing character sets with the standard Unicode Transformation Format (UTF).
The Unicode Standard has been successful and is implemented in XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML. It is also supported in many operating systems and all modern browsers.
The Unicode Consortium collaborates with leading standard development organizations such as ISO, W3C, and ECMA.
Unicode can be compatible with different character sets. The most commonly used encoding methods are UTF-8 and UTF-16:
Character Set | Description |
---|---|
UTF-8 | Characters in UTF-8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backward compatible with ASCII. UTF-8 is the preferred encoding for web pages and email. |
UTF-16 | The 16-bit Unicode Transformation Format is a variable-length character encoding that can encode the entire set of Unicode code points. UTF-16 is primarily used in environments and operating systems such as Microsoft's Windows 2000/XP/2003/Vista/CE, Java, and .NET byte code environments. |
Note: The first 256 Unicode character set characters correspond to the 256 ISO-8859-1 characters.
Note: All HTML 4 browsers support UTF-8, while all XHTML and XML processors support both UTF-8 and UTF-16!