In this article we will discover the magic world of encoding. The first part will draw a little historical schema. This schema will help us to understand the changes occurred through the decades and to understand why today we use an UTF-* standard. The next paragraphs will detail a little bit the new encoding standards, starting by Unicode and ending by UTF and their main branches.
Short historical schema of encoding
Historically, the country which developed the current IT industry was the USA. The English alphabet doesn't contain any accentuated letters. It's the reason why the first encoding character were exclusively unaccentuated. This first encoding character set is known as ASCII (American Standard Code for Information Interchange). The total of characters encoded by ASCII is 128, including : numbers (0-9), a-z letters lower and upper cased, basic punctuation symbols (., etc.) and blank space. It encodes all of those characters into 7-bits binary integers. For example, a letter H is encoded as 01001000 and lowercase h as 01101000.
But the IT revolution continues and other countries as Japan or France, become the very important actors on the market. French or Japanese contain more characters than English. Everybody starts to make his own character set. In additionally, the computers support already 8-bits encoding. The characters 128-255 can be finally filled. But the problem is they correspond to different letters for different character sets. For example, in ISO 8859-1 encoding, the character 249 corresponds to "ù" while in ISO 8859-2 encoding, the same character prints "ů".
To agree everybody, the Unicode consortium is found. The purpose of Unicode is to guarantee one encoding standard for the most popular world's languages. Thanks to it, the developers can mix different encodings at the same page, even Japanese with Polish and French. The approach of encoding changes with the Unicode's arrival. Unicode changes way of interpreting characters. It doesn't match characters to bytes but to code points.
What are Unicode code points ?
Unicode introduces a sort of map where the letter of the most alphabets on the world has its own line. This line contains two elements magic symbol and canonical representation of the letter. The magic symbol begins by U+ and is preceded by a number. It gives a composition like "U+0048" which corresponds to representation of the letter "H". Beware because Unicode distinguishes capital and lower case letters. For our example, the lowercase "h" is represented by U+0068.
The letter's symbols (0048, 0068...) are stored on 2 bytes. Our "h" letters will be represented by "00 48 00 68". But for some of developers didn't accept the necessity of 2-bytes encoding. They started to ignore the Unicode standard and invented an Unicode implementation called UTF (Unicode Transformation Format or UCS Transformation Format, both terms are the synonyms for the same concept).
The most popular character encoding of UTF is UTF-8. Unlike to Unicode, it doesn't store each code point in double byte. Instead of it, the first 128 code points (from 0 to 127) are stored in a single byte. The rest of them (from 128) are stored using 2 and even 4 bytes. Thanks to it, the applications written only with ASCII characters doesn't require any supplementary transformation to be conform with the new encoding. So, we can deduct that UTF is a kind of mapping between Unicode code points (U+0048) and theirs byte sequence representations (110).
UTF-8 is one from several ways to transform Unicode to a sequence of bytes. Another one is UCS-2 which stores letters in two byte methods. But they are also another "UTF-* family members" :
- UTF-7 (7-bit Unicode Transformation Format) : unlike e-mail content, e-mail MIME headers can't contain values above the ASCII range. SMTP, e-mail's transfer standard, doesn't guarantee a proper 8-bit encoding. It's why we use UTF-7 encoding.
- UTF-16 (16-bit Unicode Transformation Format) : this UTF's branch introduces a new idea to encoding world. All UTF-8 characters are encoded by UTF-16 with a single 16-bit code unit. The rest of characters (1 million) are stored on a pair of this 16-bit code unit. Those characters are called surrogates. That means that they can represent a character only as a pair of two 16-bit code units. The one million characters produce rarely used symbols as for example historical symbols.
- UTF-32 or UCS-4 : every character is encoded as a single 32-bit unit. The main advantage of it is the possibility to cover all possible characters. But it's more greedy than UTF-16 and UTF-8 because it takes 4 bytes (32 bits) of every character. Even for the characters which don't need this 4 bytes of storage.
Now when we know about different ways of Unicode Transformation Formats, we can point some of benefits and disadvantages of them. They are put in the below table :
What is Basic Multilingual Plane (BMP) ?
Previously we saw that Unicode characters are illustrated by U+code point. We also saw that they were some common used characters and some rarely used characters. Both are stored in a various planes. The plane of the common characters contains 65 536 code points is called the Basic Multilingual Plane (BMP).
The number of 65 536 represents the maximum number of bit permutations that we can get in 2 bytes. The characters outside this first plane are considered as rarely used. The names of other planes are :
- 1 : Supplementary Multilingual Plane
- 2 : Supplementary Ideographic Plane
- 3-13: Unassigned
- 14 : Supplementary Special-purpose Plane
- 15 : Supplementary Private Use Area
If you want to discover which characters are placed in every plane, please consult Wikipedia page about Unicode characters.
In this article we saw the evolution of encoding. In the beginning, only English characters were available. But ASCII encoding taken only 128 numbers and was quickly expanded with the new personalized encoding. But those personalized encoding didn't permit to mix them on the same application. It's one of the reasons why an organization responsible for encoding standardization, was created. Unicode, and after it, UTF, permit today to create the web applications destined to a global Internet user.
If you liked it, you should read: Tips to discover internals of an Open Source framework internals - Apache Spark use case Let it crash model Amortized and effective constant time