Hello friends, Today we are going to discuss about Encoding Schemas, Usage, and Differences. However this is a simple topic but still there are many ways, one can get confused easily with types and pros\cons of various encoding schemas.
We’ll cover basic information about this topic. So let’s move step by step.
What is Encoding\Decoding
We all in this world communicate with each other using some language e.g. English, Spanish, and German etc. Each language has some set of characters which we used to write, read and understand.
Do you think that computer also able to understand these characters in its simple form?
The answer is No.
Computers doesn’t understand these characters. Computers understand only one language i.e. 0\1. These 0 and 1 are signals which used to maintain a state in computer memory. This state can be accessed later on and transformed into desired results.
Every character which we type or see in computer are saved somewhere in form of 0 and 1 e.g. If I type my name “Deepak Gera” then this name will be converted into stream of 0\1 by using some algorithm and then this stream will be stored somewhere in computer.
Later on when I try to access my name then this stream will be read from memory location and will be transformed into characters using the same algorithm which was used previously for transformation.
“The process of transforming characters into stream of bytes is called as Encoding”
“The process of transforming encoded bytes into characters is called as Decoding”
Microsoft says “Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.”
Most popular types of Encoding Schemas
There are many encoding schemas which discovered\evolved time to time based on the requirements and shortcomings of previous ones. Let’s discuss most popular schemas and see how these evolved.
ASCII stands for American standard code for Information Interchange. This code is basically used for identifying characters and numerals in a keyboard. These are 8 bit sequence code and are used for identifying letters, numbers and special characters.
ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 i.e. 128 distinct characters. The last bit (8th) is used for avoiding errors as parity bit.
Below is the ASCII chart shows mapping of character and its corresponding numeric value. This numeric value is then converted into 7 bit binary value and stored\transferred.
But 127 characters are not enough to capture complete language so ASCII started using 8th bit also to encode more characters to support language (to support “é”, in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters).
Below is the list of extended characters which are now supported by ASCII.
ASCII Extended solves the problem for languages that are based on the Latin alphabet. But what about the others languages which have completely different character set. How those languages will be encoded.
That’s the reason behind Unicode. Unicode doesn’t contain every character from every language, but it sure contains a gigantic amount of characters.
You can check entire Unicode character set here.
In the Unicode standard, a character is called as “Code Point”. Unicode character set is divided into 17 blocks. Each block is a continuous group of 65,536 (2^16) code points. Each block is called as “Plane”.
There are 17 planes, identified by the numbers 0 to 16.
Plane-0 is called as BMP (Basic Multilingual Plane)
Basic Multilingual Plane:
The first plane, plane 0, the Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. Most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.
UTF is a super set of ASCII symbols. We need schemas which can transform set of characters into relevant Unicode code points. Below are few popular schemas which are used here.
UTF-7, UTF-8, UTF-16, UTF-32
Let’s discuss these
UTF-7 is an encoding that is used to encode Unicode characters by using only the ASCII characters. This encoding has the advantage that even in environments or operating systems that understand only 7-bit ASCII, Unicode characters can be represented and transferred.
For example, some Internet protocols such as SMTP for email, only allow the 128 ASCII characters and all other major bytes are not allowed. All of the other UTF encodings use at least 8 bits, so that they cannot be used for such purposes.
All those characters which are there in ASCII are converted to its normal ASCII codes. All other characters are encoded and also converted to ASCII characters. The + marks the beginning of such an encoding, the – (or any other character which cannot occur in the encoding) marks the end.
The German word for cheese “Käse”, for instance, would be coded as K+AOQ-se. The ASCII characters K, s and e will be the same, while the ä will be converted to AOQ (other ASCII characters). The beginning and the end of this coding are marked with + and -. Same way decoding happens.
Compatibility with ASCII character set.
Because of issues with robustness and security, you should not use UTF-7 encoding in 8-bit environments where UTF-8 encoding can be used instead.
Main difference between UTF-8, UTF-16 and UTF-32 character encoding is how many bytes it require to represent a character in memory.
UTF-8 uses minimum one byte, while UTF-16 uses minimum 2 bytes. BTW, if character’s code point is greater than 127, maximum value of byte then UTF-8 may take 2, 3 or 4 bytes but UTF-16 will only take either two or four bytes. On the other hand, UTF-32 is fixed width encoding scheme and always uses 4 bytes to encode a Unicode code point.
Fundamental difference between UTF-32 and UTF-8, UTF-16 is that former is fixed width encoding scheme, while later duo is variable length encoding.
- Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
- No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
- UTF-8 is independent of byte order, so you don’t have to worry about Big Endian / Little Endian issue.
- Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
- Even though byte order doesn’t matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
- BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese, and Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
- Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit chars the primitive component of the string.
Note: .NET strings are UTF-16 because that’s an excellent fit with the operating system encoding, no conversion is required.
But why UTF-16?
This is because of history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. At that time to cover all these characters two bytes were enough. So at that time UCS-16 was adopted as standard.
To maintain compatibility with windows UCS-2 encoding, UTF-16 was adopted as new standard for in-memory transformations.
- Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
- Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn’t. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
- Its variable length, so counting or indexing codepoints is costly, though less than UTF-8.
- This has fixed length i.e. 4 bytes, which fastens indexing by codepoint.
- Takes most memory (4 bytes for all code points) compared to UTF-8/16.
UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.
UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to Unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.
That’s all about encoding now. In upcoming articles we’ll discuss more about encoding implementation.