Easy Tutorial
❮ Ref Html Entities Q Ref Utf Box ❯

HTML Character Set


To display an HTML page correctly, a browser must know which character set (character encoding) to use.


HTML Character Set

What is the correct character encoding in HTML?

The default character encoding in HTML5 is UTF-8.

This has not always been the case. Early web character encodings were ASCII.

Later, from HTML 2.0 to HTML 4.01, ISO-8859-1 was recognized as the standard.

With the advent of XML and HTML5, UTF-8 finally arrived, solving a multitude of character encoding issues.

Below is a brief overview of character encoding standards.


In the Beginning: ASCII

Computer information (numbers, text, images) is stored in electronic form as binary 1s and 0s (01000101).

To standardize the storage of alphanumeric characters, ASCII (American Standard Code for Information Interchange) was created. It defined a unique 7-bit binary number for each stored character, supporting digits 0-9, uppercase/lowercase English letters (a-z, A-Z), and some special characters, such as ! $ + - ( ) @ < >.

Since ASCII uses one byte (7 bits for the character, 1 bit for transmission parity control), it can only represent 128 different characters. Of these, 32 are reserved for other control purposes.

The major drawback of ASCII is that it excludes non-English letters.

ASCII is still widely used today, especially in large computer systems.

For a deeper understanding of ASCII, please refer to the complete ASCII reference manual.


In Windows: ANSI

ANSI (also known as Windows-1252) was the default character set in Windows 95 and earlier Windows systems.

ANSI is an extension of ASCII, adding international characters. It uses a full byte (8 bits) to represent 256 different characters.

Since ANSI became the default character set in Windows, all browsers support ANSI.

For a deeper understanding of ANSI, please refer to the complete ANSI reference manual.


In HTML 4: ISO-8859-1

Due to most countries using characters beyond ASCII, in the HTML 2.0 standard, the default character encoding was changed to ISO-8859-1.

ISO-8859-1 is an extension of ASCII, adding international characters. Like ANSI, it uses a full byte (8 bits) to represent 256 different characters.

| | When a browser detects ISO-8859-1 in a web page, it usually defaults to ANSI, as ANSI is almost identical to ISO-8859-1 except for 32 additional characters. | | --- | --- |

If an HTML 4 web page uses a character set other than ISO-8859-1, it needs to be specified in the <meta> tag, as shown below:

Example

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-8">

| | The default character set in HTML5 is UTF-8. All HTML 4 processors support UTF-8, and all HTML5 and XML processors support both UTF-8 and UTF-16. | | --- | --- |

For a deeper understanding of ISO-8859-1, please refer to the complete ISO-8859-1 reference manual.


In HTML5: Unicode (UTF-8)

Due to the limitations of the aforementioned character sets, which were incompatible in multilingual environments, the Unicode Consortium developed the Unicode Standard.

The Unicode Standard covers (almost) all characters, punctuation, and symbols.

Unicode makes text processing, storage, and transport independent of platform and language.

The default character encoding in HTML5 is UTF-8.

For a deeper understanding of Unicode (UTF-8), please refer to the complete Unicode reference manual.

❮ Ref Html Entities Q Ref Utf Box ❯