Character encoding

Some years ago, a big part of web developers were asking the question : How to make work our application on international environment ? Actually the encoding issues are less present and all thanks to popularization of Unicode Transformation Format, commonly known as UTF.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this article we will discover the magic world of encoding. The first part will draw a little historical schema. This schema will help us to understand the changes occurred through the decades and to understand why today we use an UTF-* standard. The next paragraphs will detail a little bit the new encoding standards, starting by Unicode and ending by UTF and their main branches.

Short historical schema of encoding

Historically, the country which developed the current IT industry was the USA. The English alphabet doesn't contain any accentuated letters. It's the reason why the first encoding character were exclusively unaccentuated. This first encoding character set is known as ASCII (American Standard Code for Information Interchange). The total of characters encoded by ASCII is 128, including : numbers (0-9), a-z letters lower and upper cased, basic punctuation symbols (., etc.) and blank space. It encodes all of those characters into 7-bits binary integers. For example, a letter H is encoded as 01001000 and lowercase h as 01101000.

But the IT revolution continues and other countries as Japan or France, become the very important actors on the market. French or Japanese contain more characters than English. Everybody starts to make his own character set. In additionally, the computers support already 8-bits encoding. The characters 128-255 can be finally filled. But the problem is they correspond to different letters for different character sets. For example, in ISO 8859-1 encoding, the character 249 corresponds to "ù" while in ISO 8859-2 encoding, the same character prints "ů".

To agree everybody, the Unicode consortium is found. The purpose of Unicode is to guarantee one encoding standard for the most popular world's languages. Thanks to it, the developers can mix different encodings at the same page, even Japanese with Polish and French. The approach of encoding changes with the Unicode's arrival. Unicode changes way of interpreting characters. It doesn't match characters to bytes but to code points.

What are Unicode code points ?

Unicode introduces a sort of map where the letter of the most alphabets on the world has its own line. This line contains two elements magic symbol and canonical representation of the letter. The magic symbol begins by U+ and is preceded by a number. It gives a composition like "U+0048" which corresponds to representation of the letter "H". Beware because Unicode distinguishes capital and lower case letters. For our example, the lowercase "h" is represented by U+0068.

The letter's symbols (0048, 0068...) are stored on 2 bytes. Our "h" letters will be represented by "00 48 00 68". But for some of developers didn't accept the necessity of 2-bytes encoding. They started to ignore the Unicode standard and invented an Unicode implementation called UTF (Unicode Transformation Format or UCS Transformation Format, both terms are the synonyms for the same concept).

The most popular character encoding of UTF is UTF-8. Unlike to Unicode, it doesn't store each code point in double byte. Instead of it, the first 128 code points (from 0 to 127) are stored in a single byte. The rest of them (from 128) are stored using 2 and even 4 bytes. Thanks to it, the applications written only with ASCII characters doesn't require any supplementary transformation to be conform with the new encoding. So, we can deduct that UTF is a kind of mapping between Unicode code points (U+0048) and theirs byte sequence representations (110).

UTF branches

UTF-8 is one from several ways to transform Unicode to a sequence of bytes. Another one is UCS-2 which stores letters in two byte methods. But they are also another "UTF-* family members" :

  1. UTF-7 (7-bit Unicode Transformation Format) : unlike e-mail content, e-mail MIME headers can't contain values above the ASCII range. SMTP, e-mail's transfer standard, doesn't guarantee a proper 8-bit encoding. It's why we use UTF-7 encoding.
  2. UTF-16 (16-bit Unicode Transformation Format) : this UTF's branch introduces a new idea to encoding world. All UTF-8 characters are encoded by UTF-16 with a single 16-bit code unit. The rest of characters (1 million) are stored on a pair of this 16-bit code unit. Those characters are called surrogates. That means that they can represent a character only as a pair of two 16-bit code units. The one million characters produce rarely used symbols as for example historical symbols.
  3. UTF-32 or UCS-4 : every character is encoded as a single 32-bit unit. The main advantage of it is the possibility to cover all possible characters. But it's more greedy than UTF-16 and UTF-8 because it takes 4 bytes (32 bits) of every character. Even for the characters which don't need this 4 bytes of storage.

Now when we know about different ways of Unicode Transformation Formats, we can point some of benefits and disadvantages of them. They are put in the below table :

UTF- * Benefits Disadvantages
UTF-8
  • backward compatibility with ASCII (the same encoding size, 1 byte, for the first 128 characters)
  • ideal to store documents written in majority with ASCII characters. It takes less place than UTF-16 or UTF-32.
  • some characters need more place (for exemple code points from U+0800 to U+FFFF for which UTF-8 needs 3 bytes and UTF-16 only 2).
UTF- 16
  • better than UTF-8 to store Asian texts (Samaritan, Mandaic, Tibetan...) . UTF-16 needs only 2 bytes for Asian characters while UTF-8 requires 3 bytes. We can observe this difference by analyzing the Unicode code points from U+0800 to U+FFFF.
  • not compatible with ASCII encoded files. UTF-16 uses at least 2 bytes to encode a character while ASCII uses always 1 byte. The file encoded in UTF-16 will be always bigger than the ASCII ones.
UTF- 32
  • cover all possible and imaginary characters
  • the fixed characters length simplifies the string operations. We don't need to check the size of every character anymore before comparing strings.
  • each character is stored on 4 bytes while some of them (as ASCII characters) need less on another encoding standards

What is Basic Multilingual Plane (BMP) ?

Previously we saw that Unicode characters are illustrated by U+code point. We also saw that they were some common used characters and some rarely used characters. Both are stored in a various planes. The plane of the common characters contains 65 536 code points is called the Basic Multilingual Plane (BMP).

The number of 65 536 represents the maximum number of bit permutations that we can get in 2 bytes. The characters outside this first plane are considered as rarely used. The names of other planes are :
- 1 : Supplementary Multilingual Plane
- 2 : Supplementary Ideographic Plane
- 3-13: Unassigned
- 14 : Supplement­ary Special-purpose Plane
- 15 : Supplement­ary Private Use Area

If you want to discover which characters are placed in every plane, please consult Wikipedia page about Unicode characters.

In this article we saw the evolution of encoding. In the beginning, only English characters were available. But ASCII encoding taken only 128 numbers and was quickly expanded with the new personalized encoding. But those personalized encoding didn't permit to mix them on the same application. It's one of the reasons why an organization responsible for encoding standardization, was created. Unicode, and after it, UTF, permit today to create the web applications destined to a global Internet user.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!