0

Encoding using C#

Hi Friends, as we’ve discussed about encoding decoding in our previous article, so in this article we are going to discuss how we can implement encoding using c#.

Let’s summarize about encoding\decoding:

Computers doesn’t understand these characters. Computers understand only one language i.e. 0\1. These 0 and 1 are electric signal which used to maintain a state in computer memory. This state can be accessed later on and transformed into desired results.

Every character which we type or see in computer are saved somewhere in form of 0 and 1 e.g. If I type my name “Deepak Gera” then this name will be converted into stream of 0\1 by using some algorithm and then this stream will be stored somewhere in computer.

Later on when I try to access my name then this stream will be read from memory location and will be transformed into characters using the same algorithm which was used previously for transformation.

“The process of transforming characters into stream of bytes is called as Encoding”

“The process of transforming encoded bytes into characters is called as Decoding”

Encoding using C#

Create one console application and write following code in Program.cs file.   

class Program
{
    static void Main(string[] args)
    {
        string myData = "A";
        byte[] encodedData = Encode(myData);
        Console.WriteLine($"Encoded Data: {encodedData}");

        string origData = Decode(encodedData);
        Console.WriteLine($"Original Data: {origData}");
        Console.ReadLine();
    }

    public static byte[] Encode(string text)
    {
        byte[] dataBytes = System.Text.Encoding.UTF8.GetBytes(text);
        return dataBytes;
    }

    public static string Decode(byte[] dataBytes)
    {
        string returntext = System.Text.Encoding.UTF8.GetString(dataBytes);
        return returntext;
    }
}

Above code is having 2 methods. One method is for encoding in c# and another method is for decoding.

As you can see that I’ve used UTF8 encoding schema so when I debug this code, I got 1 byte for character “A”. ASCII code for “A” is 65. If I use UTF7, then also I’ll get the same results as UTF7 is also 1 byte.

encoding using c#

Let’s check another character which takes 2 bytes. “¢” symbol takes 2 bytes in code pages so let’s check this with UTF8 schema. We can see below that this character is taking 2 bytes.

If we encode same character using UTF7 schema then it will convert using symbols from ASCII so it will take more bytes so it is considered less efficient for multi byte characters.

encoding using c#

Same way you can perform encoding using c# using different schemas

ASCII

byte[] dataBytes = System.Text.Encoding.ASCII.GetBytes(text);

UTF-16 (Little Endian)

byte[] dataBytes = System.Text.Encoding.Unicode.GetBytes(text);

UTF-16 (Big Endian)

byte[] dataBytes = System.Text.Encoding.BigEndianUnicode.GetBytes(text);

In little endian machines, least significant byte of binary representation of the multi-byte datatype is stored first. On the other hand, in big endian machines, most significant byte of binary representation of the multi-byte datatype is stored first. You can see more about Big Endian\Little Endian

UTF-32

byte[] dataBytes = System.Text.Encoding.UTF32.GetBytes(text);

 

 

You can check yourself and see how these schema results are differ.

0

Little and Big Endian Mystery

Hello friends, as we’ve already discussed about Encoding and types so let’s now discuss little bit about “Little and Big Endian Mystery”.

Small topic but play important role whenever we talk about data transfer and storage.

What are these Little and Big Endian?

Both are ways to store multi-byte data types e.g. int, float etc.

In little endian machines, least significant byte of binary representation of the multi-byte datatype is stored first. On the other hand, in big endian machines, most significant byte of binary representation of the multi-byte datatype is stored first.

Their difference is similar to the difference between English and Arabic.

English is written and read from left to right, while Arabic from right to left.

Suppose integer is stored as 4 bytes then a variable x with value 0x01234567 will be stored as following.

Big Endian’s Advantages

Easier for (most) human to read:

When examining memory values. This sometimes also applies to serializing/deserializing values when communicating with networks.

Easier sign checking:

By checking the byte at offset 0 we can easily check sign.

Easier comparison:

Useful in arbitrary-precision math, as numbers are compared from the most significant digit.

No need for endianness conversion:

No conversion needed when sending/receiving data to/from the network. This is less useful because network adapters can already swap bytes and copy them to memory in the correct order without the help of the CPU, and most modern CPUs have the ability to swap bytes themselves.

Little Endian’s Advantages

Easier parity checking:

Parity check is easy by checking the byte at offset 0 we can see that it’s odd or even.

Easier for some people to read:

Arabic, Hebrew and many other languages write from right to left so they read numbers in little-endian order. Some languages also read number values in little-endian order (like 134 as 4 units, 3 tens and 1 hundred), so it’s easier to know how big the current digit is and the thousand separator will be less useful.

Natural in computation:

  • Mathematics operations mostly work from least to most significant digit, so it’s much easier to work in little endian.
  • This is extremely useful in Arbitrary-precision arithmetic (or any operations that are longer than the architecture’s natural word size like doing 64-bit maths on 32-bit computers) because it would be much more painful to read the digits backwards and do operations.
  • It’s also useful in situations like in case a computer with limited memory bandwidth (like some 32-bit ARM microcontrollers with 16-bit bus, or the Intel 8088 with 16-bit register but 8-bit data bus). Now the 32-bit CPU can do math 16 bits at a time by reading a half word at address A, add it while still reading the remaining half word at A+2 then do the final add instead of waiting for the two reads to be finished then adding from the LSB.

Always reads as the same value:

It always read same value if reading in the size less than or equal to the written value.

For example 20 = 0x14 if writing as a 64-bit value into memory at address A will be 14 00 00 00 00 00 00 00, and will always be read as 20 regardless of using 8, 16, 32, 64-bit reads (or actually any reads with length <= 64 at the address A like 24, 48 or 40 bits). This can be extended to arbitrarily longer types.

In big-endian system you have to know in which size you have written the value, in order to read it correctly. For example to get the least significant byte you need to read at byte A+n-1 (with n is the length in bytes of the write) instead of A.

This property also makes it easy to cast the value to a smaller type like int32 to int16 because the int16 value will always lie at the beginning of int32.

How to check Endianness?

Execute below program on your machine and you’ll be able to check.

#include<stdio.h>
void byte_order(char *start, int num)
{
    int i;
    for(i=0 ; i<n ; i++)
        printf("%.2x",start[i]);

    printf("\n");
}

void main()
{
    int n = 0x01234567;
    byte_order((char*)&n,n);
}

The above program when run on a Big Endian Machine produces ’01 23 45 67′ as output, while on a Little Endian Machine produces ’67 45 23 01′.

Final Note

Both big and little endian have their advantages and disadvantages. Even if one were clearly superior (which is not the case), there is no way that any legacy architecture would ever be able to switch endianness. You can have a look into more details about Endianness on Wiki.