HTML `Character Set`

HTML Character Set

To display an HTML page correctly, a browser must know which character set to use.

The character set used early in the web was ASCII. ASCII supported numbers 0-9, the English alphabet (both uppercase and lowercase), and some special characters.

Complete ASCII Reference Manual.

Since many countries use characters that are not part of ASCII, the default character set for modern browsers is ISO-8859-1.

Complete ISO-8859-1 Reference Manual.

If a web page uses a character set different from ISO-8859-1, it should be specified in the <meta> tag.

Try it Yourself

ISO Character Sets

ISO character sets are standards defined by the International Organization for Standardization (ISO) for different alphabets/languages.

Below is a list of different character sets used around the world:

Character Set	Description	Usage
ISO-8859-1	Latin alphabet part 1	North America, Western Europe, Latin America, Caribbean, Canada, Africa
ISO-8859-2	Latin alphabet part 2	Eastern Europe
ISO-8859-3	Latin alphabet part 3	SE Europe, Esperanto, other miscellaneous
ISO-8859-4	Latin alphabet part 4	Scandinavian/Baltic (and other parts not included in ISO-8859-1)
ISO-8859-5	Latin/Cyrillic part 5	Languages using the ancient Slavic alphabet, such as Bulgarian, Belarusian, Russian, Macedonian
ISO-8859-6	Latin/Arabic part 6	Languages using the Arabic alphabet
ISO-8859-7	Latin/Greek part 7	Modern Greek, and Greek-derived mathematical symbols
ISO-8859-8	Latin/Hebrew part 8	Languages using the Hebrew alphabet
ISO-8859-9	Latin 5 part 9	Turkish. Same as ISO-8859-1 except for Turkish characters replacing Icelandic letters.
ISO-8859-10	Latin 6	Lapland, Germanic, Eskimo Nordic
ISO-8859-15	Latin 9 (aka Latin 0)	Similar to ISO 8859-1, with the Euro symbol and some other characters replacing less commonly used symbols
ISO-2022-JP	Latin/Japanese part 1	Japanese
ISO-2022-JP-2	Latin/Japanese part 2	Japanese
ISO-2022-KR	Latin/Korean part 1	Korean

Unicode Standard

Due to the capacity limitations and incompatibility with multilingual environments of the character sets listed above, the Unicode Consortium developed the Unicode Standard.

The Unicode Standard covers all characters, punctuation, and symbols in the world.

Unicode enables processing, storage, and interchange of text data, regardless of the platform, program, or language.

Unicode Consortium

The Unicode Consortium developed the Unicode Standard. Their goal is to replace existing character sets with the standard Unicode Transformation Format (UTF).

The Unicode Standard has been successful and is implemented in XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML. It is also supported in many operating systems and all modern browsers.

The Unicode Consortium collaborates with leading standard development organizations such as ISO, W3C, and ECMA.

Unicode can be compatible with different character sets. The most commonly used encoding methods are UTF-8 and UTF-16:

Character Set	Description
UTF-8	Characters in UTF-8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backward compatible with ASCII. UTF-8 is the preferred encoding for web pages and email.
UTF-16	The 16-bit Unicode Transformation Format is a variable-length character encoding that can encode the entire set of Unicode code points. UTF-16 is primarily used in environments and operating systems such as Microsoft's Windows 2000/XP/2003/Vista/CE, Java, and .NET byte code environments.

Note: The first 256 Unicode character set characters correspond to the 256 ISO-8859-1 characters.

Note: All HTML 4 browsers support UTF-8, while all XHTML and XML processors support both UTF-8 and UTF-16!

❮ Att Audio Autoplay Tag Canvas ❯