0

What is Base64 Encoding and how it works

Hello Friends, You might have heard about a term called Base64 encoding here and there. This is also one kind of encoding schema which is worth pointing out.

Today we are going to discuss about Base64 and its usage.

Base64 Encoding

In simple terms, Base64 encoding is used in environments where, “perhaps for legacy reasons,” the “storage or transfer” of data is limited to ASCII characters.

According to Wiki:

Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation.

When you have some binary data that you want to ship across a network, you generally don’t do it by just streaming the bits and bytes over the wire in a raw format. Because some media are made for streaming text. There may be some protocols which may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you’ve entered a special character combination (like how FTP translates line endings).

So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings.

Image data\File data is generally transferred into base64 encoding.

How does it works?

Base64 encoding takes minimum three bytes, each consisting of eight bits, and represents them as four printable characters in the ASCII standard.

Each character in Base64 can be multiple of 3 bytes, these bytes are then grouped into 6 bits. Padding is added if data bytes are not multiple of 3 bytes.

Now let’s see with examples, how it works:

Encode “A” with UTF7 (1 Byte)

I’ve written very simple code to show this with example

static void Main(string[] args)
{
    string myData = "A";
    byte[] dataBytes = System.Text.Encoding.UTF7.GetBytes(myData);
    Console.WriteLine(System.Convert.ToBase64String(dataBytes));
    Console.ReadLine();
}

In first example I am trying to convert “A”. Before discussing how it works, let’s discuss about the output. If I run this code then below will be the output.

base64 encoding single byte output

Now let’s see how this is calculated.

If you do not have idea how to encode decode in c# then you can check this.

In first line of code, I am passing character “A” in variable.

In next line I am using UTF7 encoding to get data bytes for “A”.

base64 encoding single byte code

  1. As we can see that in UTF7 we got 1 data byte i.e. “65”

Binary of 65 = 01000001

  1. As we discussed that data bytes should be multiple of 3 bytes. But our data is having one byte so we need to add 2 bytes of padding.

2 bytes of padding = 00000000 00000000

So total data bytes are: 01000001 00000000 00000000

  1. Now convert above bits to multiple of 6

First 6 bits group – 010000

Second group – 010000

Third group – 000000

Forth group – 000000

  1. Now calculate decimal values of these groups each.

First group – 010000 – 16 (Decimal)

Second group – 010000 – 16 (Decimal)

Third group – 000000 – It is padding so padding symbol is “=”

Forth group – 000000 – It is padding so padding symbol is “=”

  1. Now use below chart to get character related to decimal value.

base64 encoding char set

First group – 010000 – 16 (Decimal) – Q

Second group – 010000 – 16 (Decimal) – Q

Third group – 000000 – It is padding so padding symbol is “=”

Forth group – 000000 – It is padding so padding symbol is “=”

So calculated base64 string is “QQ==”

 

UTF7 Multibyte Data

Let’s consider one example where we have multiple bytes. Change code to use below character.

 

 

I’ve written very simple code to show this with example

Let’s discuss about the output. If I run this code then below will be the output.

base64 encoding multibyte output

 

Now let’s see how this is calculated.

In first line of code, I am passing different language character in variable.

In next line I am using UTF7 encoding to get data bytes.

base64 encoding multibyte code

  1. As we can see that in UTF7 we got 5 data bytes i.e. “43”,”65”,”75”,”73”,”45”. Below are calculated bits

43-00101011

65-01000001

75-01001011

73-01001001

45-00101101

  1. As we discussed that data bytes should be multiple of 3 bytes. But our data is having 5 bytes so we need to add 1 byte of padding.

1 bytes of padding = 00000000

So total data bytes are:

00101011 01000001 01001011 01001001 00101101 00000000

  1. Now convert above bits to multiple of 6

First 6 bits group – 001010

Second group – 110100

Third group – 000101

Forth group – 001011

Fifth group – 010010

Sixth group – 010010

Seventh group – 110100

Eighth group – 000000

  1. Now calculate decimal values of these groups each.

First group – 001010 – 10

Second group – 110100 – 52

Third group – 000101 – 5

Forth group – 001011 – 11

Fifth group – 010010 – 18

Sixth group – 010010 – 18

Seventh group – 110100 – 52

Eighth group – 000000 – It is padding so padding symbol is “=”

  1. Now use above chart to get character related to decimal value.

First group – 001010 – 10 – K

Second group – 110100 – 52 – 0

Third group – 000101 – 5 – F

Forth group – 001011 – 11 – L

Fifth group – 010010 – 18 – S

Sixth group – 010010 – 18 – S

Seventh group – 110100 – 52 – 0

Eighth group – 000000 – It is padding so padding symbol is “=”

So calculated base64 string is “K0FLSS0=”

0

Encoding using C#

Hi Friends, as we’ve discussed about encoding decoding in our previous article, so in this article we are going to discuss how we can implement encoding using c#.

Let’s summarize about encoding\decoding:

Computers doesn’t understand these characters. Computers understand only one language i.e. 0\1. These 0 and 1 are electric signal which used to maintain a state in computer memory. This state can be accessed later on and transformed into desired results.

Every character which we type or see in computer are saved somewhere in form of 0 and 1 e.g. If I type my name “Deepak Gera” then this name will be converted into stream of 0\1 by using some algorithm and then this stream will be stored somewhere in computer.

Later on when I try to access my name then this stream will be read from memory location and will be transformed into characters using the same algorithm which was used previously for transformation.

“The process of transforming characters into stream of bytes is called as Encoding”

“The process of transforming encoded bytes into characters is called as Decoding”

Encoding using C#

Create one console application and write following code in Program.cs file.   

class Program
{
    static void Main(string[] args)
    {
        string myData = "A";
        byte[] encodedData = Encode(myData);
        Console.WriteLine($"Encoded Data: {encodedData}");

        string origData = Decode(encodedData);
        Console.WriteLine($"Original Data: {origData}");
        Console.ReadLine();
    }

    public static byte[] Encode(string text)
    {
        byte[] dataBytes = System.Text.Encoding.UTF8.GetBytes(text);
        return dataBytes;
    }

    public static string Decode(byte[] dataBytes)
    {
        string returntext = System.Text.Encoding.UTF8.GetString(dataBytes);
        return returntext;
    }
}

Above code is having 2 methods. One method is for encoding in c# and another method is for decoding.

As you can see that I’ve used UTF8 encoding schema so when I debug this code, I got 1 byte for character “A”. ASCII code for “A” is 65. If I use UTF7, then also I’ll get the same results as UTF7 is also 1 byte.

encoding using c#

Let’s check another character which takes 2 bytes. “¢” symbol takes 2 bytes in code pages so let’s check this with UTF8 schema. We can see below that this character is taking 2 bytes.

If we encode same character using UTF7 schema then it will convert using symbols from ASCII so it will take more bytes so it is considered less efficient for multi byte characters.

encoding using c#

Same way you can perform encoding using c# using different schemas

ASCII

byte[] dataBytes = System.Text.Encoding.ASCII.GetBytes(text);

UTF-16 (Little Endian)

byte[] dataBytes = System.Text.Encoding.Unicode.GetBytes(text);

UTF-16 (Big Endian)

byte[] dataBytes = System.Text.Encoding.BigEndianUnicode.GetBytes(text);

In little endian machines, least significant byte of binary representation of the multi-byte datatype is stored first. On the other hand, in big endian machines, most significant byte of binary representation of the multi-byte datatype is stored first. You can see more about Big Endian\Little Endian

UTF-32

byte[] dataBytes = System.Text.Encoding.UTF32.GetBytes(text);

 

 

You can check yourself and see how these schema results are differ.

0

What are Encoding Schemas, Types & Differences

Hello friends, Today we are going to discuss about Encoding Schemas, Usage, and Differences. However this is a simple topic but still there are many ways, one can get confused easily with types and pros\cons of various encoding schemas.

We’ll cover basic information about this topic. So let’s move step by step.

What is Encoding\Decoding

We all in this world communicate with each other using some language e.g. English, Spanish, and German etc. Each language has some set of characters which we used to write, read and understand.

Do you think that computer also able to understand these characters in its simple form?

The answer is No.

Computers doesn’t understand these characters. Computers understand only one language i.e. 0\1. These 0 and 1 are signals which used to maintain a state in computer memory. This state can be accessed later on and transformed into desired results.

Every character which we type or see in computer are saved somewhere in form of 0 and 1 e.g. If I type my name “Deepak Gera” then this name will be converted into stream of 0\1 by using some algorithm and then this stream will be stored somewhere in computer.

Later on when I try to access my name then this stream will be read from memory location and will be transformed into characters using the same algorithm which was used previously for transformation.

“The process of transforming characters into stream of bytes is called as Encoding”

“The process of transforming encoded bytes into characters is called as Decoding”

Microsoft says “Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.

Most popular types of Encoding Schemas

There are many encoding schemas which discovered\evolved time to time based on the requirements and shortcomings of previous ones. Let’s discuss most popular schemas and see how these evolved.

ASCII

ASCII stands for American standard code for Information Interchange. This code is basically used for identifying characters and numerals in a keyboard. These are 8 bit sequence code and are used for identifying letters, numbers and special characters.

ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 i.e. 128 distinct characters. The last bit (8th) is used for avoiding errors as parity bit.

Below is the ASCII chart shows mapping of character and its corresponding numeric value. This numeric value is then converted into 7 bit binary value and stored\transferred.

encoding schemas ascii

But 127 characters are not enough to capture complete language so ASCII started using 8th bit also to encode more characters to support language (to support “é”, in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters).

Below is the list of extended characters which are now supported by ASCII.

encoding schemas ex ascii

UTF

ASCII Extended solves the problem for languages that are based on the Latin alphabet. But what about the others languages which have completely different character set. How those languages will be encoded.

That’s the reason behind Unicode. Unicode doesn’t contain every character from every language, but it sure contains a gigantic amount of characters.

You can check entire Unicode character set here.

In the Unicode standard, a character is called as “Code Point”. Unicode character set is divided into 17 blocks. Each block is a continuous group of 65,536 (2^16) code points. Each block is called as “Plane”.

There are 17 planes, identified by the numbers 0 to 16.

Plane-0 is called as BMP (Basic Multilingual Plane)

Basic Multilingual Plane:

The first plane, plane 0, the Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. Most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

UTF is a super set of ASCII symbols. We need schemas which can transform set of characters into relevant Unicode code points. Below are few popular schemas which are used here.

UTF-7, UTF-8, UTF-16, UTF-32

Let’s discuss these

UTF-7

UTF-7 is an encoding that is used to encode Unicode characters by using only the ASCII characters. This encoding has the advantage that even in environments or operating systems that understand only 7-bit ASCII, Unicode characters can be represented and transferred.

For example, some Internet protocols such as SMTP for email, only allow the 128 ASCII characters and all other major bytes are not allowed. All of the other UTF encodings use at least 8 bits, so that they cannot be used for such purposes.

All those characters which are there in ASCII are converted to its normal ASCII codes. All other characters are encoded and also converted to ASCII characters. The + marks the beginning of such an encoding, the – (or any other character which cannot occur in the encoding) marks the end.

The German word for cheese “Käse”, for instance, would be coded as K+AOQ-se. The ASCII characters K, s and e will be the same, while the ä will be converted to AOQ (other ASCII characters). The beginning and the end of this coding are marked with + and -. Same way decoding happens.

Pros:

Compatibility with ASCII character set.

Cons:

Because of issues with robustness and security, you should not use UTF-7 encoding in 8-bit environments where UTF-8 encoding can be used instead.

UTF-8\UTF-16\UTF-32

Main difference between UTF-8, UTF-16 and UTF-32 character encoding is how many bytes it require to represent a character in memory.

UTF-8 uses minimum one byte, while UTF-16 uses minimum 2 bytes. BTW, if character’s code point is greater than 127, maximum value of byte then UTF-8 may take 2, 3 or 4 bytes but UTF-16 will only take either two or four bytes. On the other hand, UTF-32 is fixed width encoding scheme and always uses 4 bytes to encode a Unicode code point. 

Fundamental difference between UTF-32 and UTF-8, UTF-16 is that former is fixed width encoding scheme, while later duo is variable length encoding.

UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don’t have to worry about Big Endian / Little Endian issue.

UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn’t matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese, and Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit chars the primitive component of the string.

Note: .NET strings are UTF-16 because that’s an excellent fit with the operating system encoding, no conversion is required.

But why UTF-16?

This is because of history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. At that time to cover all these characters two bytes were enough. So at that time UCS-16 was adopted as standard.

To maintain compatibility with windows UCS-2 encoding, UTF-16 was adopted as new standard for in-memory transformations.

UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn’t. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • Its variable length, so counting or indexing codepoints is costly, though less than UTF-8.

UTF-32 pros:

  • This has fixed length i.e. 4 bytes, which fastens indexing by codepoint.

UTF-32 pros:

  • Takes most memory (4 bytes for all code points) compared to UTF-8/16.

 

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to Unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

That’s all about encoding now. In upcoming articles we’ll discuss more about encoding implementation.