HTML Charset (Encoding)

To display an HTML page correctly, a web browser must know which character set (character encoding) to use.

Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding.

What is Character Encoding?

The most common character set or character encoding in use on computers is ASCII − The American Standard Code for Information Interchange, and this is probably the most widely used character set for encoding text electronically.

ASCII was the first character encoding standard. ASCII defined 128 different alphanumeric characters that could be used on the internet:

  • Numbers (0-9),
  • English letters (A-Z),
  • And some special characters like ! $ + - ( ) @ < > .

ASCII encoding supports only the upper- and lowercase Latin alphabet, the numbers 0-9, and some extra characters which make a total of 128 characters in all.

The HTML charset Attribute

To display an HTML page correctly, a web browser must know the character set used in the page. This is specified in the <meta> tag

HTML4

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>

HTML5

<meta charset="UTF-8" />

The ASCII Character Set

ASCII uses the values from 0 to 31 (and 127) for control characters.

ASCII uses the values from 32 to 126 for letters, digits, and symbols.

ASCII does not use the values from 128 to 255.

Character Set & Discription

The International Standards Organization created a range of character sets to deal with different national characters. For the documents in English and most other Western European languages, the widely supported encoding ISO-8859-1 is used.

Here is the list of Character Set being used around the world along with their description.

Character Set Description
ISO-8859-1 Latin alphabet part 1
Covering North America,Western Europe, Latin America, theCaribbean, Canada, Africa
ISO-8859-2 Latin alphabet part 2
Covering Eastern Europe
ISO-8859-3 Latin alphabet part 3
Covering SE Europe, Esperanto, miscellaneous others
ISO-8859-4 Latin alphabet part 4
Covering Scandinavia/Baltics (and others not in ISO-8859-1)
ISO-8859-5 Latin/Cyrillic alphabet part 5
ISO-8859-6 Latin/Arabic alphabet part 6
ISO-8859-7 Latin/Greek alphabet part 7
ISO-8859-8 Latin/Hebrew alphabet part 8
ISO-8859-9 Latin 5 alphabet part 9
Same as ISO-8859-1 except Turkish characters replace Icelandic ones
ISO-8859-10 Latin 6 Latin 6 Lappish, Nordic, and Eskimo
ISO-8859-15 The same as ISO-8859-1 but with more characters added
ISO-2022-JP Latin/Japanese alphabet part 1
ISO-2022-JP-2 Latin/Japanese alphabet part 2
ISO-2022-KR Latin/Korean alphabet part 1

Unicode therefore specifies encodings that can deal with a string in special ways so as to make enough space for the huge character set it encompasses.

Character Set Description
UTF-8 A Unicode Translation Format that comes in 8-bit units that is, it comes in bytes. A character in UTF8 can be from 1 to 4 bytes long, making UTF8 variable width.
UTF-16 A Unicode Translation Format that comes in 16-bit units that is, it comes in shorts. It can be 1 or 2 shorts long, making UTF16 variable width.
UTF-32 A Unicode Translation Format that comes in 32-bit units that is, it comes in longs. It is a fixed-width format and is always 1 "long" in length.

The @charset CSS Rule

You can use the CSS @charset rule to specify the character encoding used in a style sheet -

@charset"UTF-8";

Email Us: advertise@gdatamart.com

Donate Us: Support to GDATAMART

© 2023 GDATAMART.COM (All Rights Reserved)