0

Little and Big Endian Mystery

Hello friends, as we’ve already discussed about Encoding and types so let’s now discuss little bit about “Little and Big Endian Mystery”.

Small topic but play important role whenever we talk about data transfer and storage.

What are these Little and Big Endian?

Both are ways to store multi-byte data types e.g. int, float etc.

In little endian machines, least significant byte of binary representation of the multi-byte datatype is stored first. On the other hand, in big endian machines, most significant byte of binary representation of the multi-byte datatype is stored first.

Their difference is similar to the difference between English and Arabic.

English is written and read from left to right, while Arabic from right to left.

Suppose integer is stored as 4 bytes then a variable x with value 0x01234567 will be stored as following.

Big Endian’s Advantages

Easier for (most) human to read:

When examining memory values. This sometimes also applies to serializing/deserializing values when communicating with networks.

Easier sign checking:

By checking the byte at offset 0 we can easily check sign.

Easier comparison:

Useful in arbitrary-precision math, as numbers are compared from the most significant digit.

No need for endianness conversion:

No conversion needed when sending/receiving data to/from the network. This is less useful because network adapters can already swap bytes and copy them to memory in the correct order without the help of the CPU, and most modern CPUs have the ability to swap bytes themselves.

Little Endian’s Advantages

Easier parity checking:

Parity check is easy by checking the byte at offset 0 we can see that it’s odd or even.

Easier for some people to read:

Arabic, Hebrew and many other languages write from right to left so they read numbers in little-endian order. Some languages also read number values in little-endian order (like 134 as 4 units, 3 tens and 1 hundred), so it’s easier to know how big the current digit is and the thousand separator will be less useful.

Natural in computation:

  • Mathematics operations mostly work from least to most significant digit, so it’s much easier to work in little endian.
  • This is extremely useful in Arbitrary-precision arithmetic (or any operations that are longer than the architecture’s natural word size like doing 64-bit maths on 32-bit computers) because it would be much more painful to read the digits backwards and do operations.
  • It’s also useful in situations like in case a computer with limited memory bandwidth (like some 32-bit ARM microcontrollers with 16-bit bus, or the Intel 8088 with 16-bit register but 8-bit data bus). Now the 32-bit CPU can do math 16 bits at a time by reading a half word at address A, add it while still reading the remaining half word at A+2 then do the final add instead of waiting for the two reads to be finished then adding from the LSB.

Always reads as the same value:

It always read same value if reading in the size less than or equal to the written value.

For example 20 = 0x14 if writing as a 64-bit value into memory at address A will be 14 00 00 00 00 00 00 00, and will always be read as 20 regardless of using 8, 16, 32, 64-bit reads (or actually any reads with length <= 64 at the address A like 24, 48 or 40 bits). This can be extended to arbitrarily longer types.

In big-endian system you have to know in which size you have written the value, in order to read it correctly. For example to get the least significant byte you need to read at byte A+n-1 (with n is the length in bytes of the write) instead of A.

This property also makes it easy to cast the value to a smaller type like int32 to int16 because the int16 value will always lie at the beginning of int32.

How to check Endianness?

Execute below program on your machine and you’ll be able to check.

#include<stdio.h>
void byte_order(char *start, int num)
{
    int i;
    for(i=0 ; i<n ; i++)
        printf("%.2x",start[i]);

    printf("\n");
}

void main()
{
    int n = 0x01234567;
    byte_order((char*)&n,n);
}

The above program when run on a Big Endian Machine produces ’01 23 45 67′ as output, while on a Little Endian Machine produces ’67 45 23 01′.

Final Note

Both big and little endian have their advantages and disadvantages. Even if one were clearly superior (which is not the case), there is no way that any legacy architecture would ever be able to switch endianness. You can have a look into more details about Endianness on Wiki.

 

0

What are Encoding Schemas, Types & Differences

Hello friends, Today we are going to discuss about Encoding Schemas, Usage, and Differences. However this is a simple topic but still there are many ways, one can get confused easily with types and pros\cons of various encoding schemas.

We’ll cover basic information about this topic. So let’s move step by step.

What is Encoding\Decoding

We all in this world communicate with each other using some language e.g. English, Spanish, and German etc. Each language has some set of characters which we used to write, read and understand.

Do you think that computer also able to understand these characters in its simple form?

The answer is No.

Computers doesn’t understand these characters. Computers understand only one language i.e. 0\1. These 0 and 1 are signals which used to maintain a state in computer memory. This state can be accessed later on and transformed into desired results.

Every character which we type or see in computer are saved somewhere in form of 0 and 1 e.g. If I type my name “Deepak Gera” then this name will be converted into stream of 0\1 by using some algorithm and then this stream will be stored somewhere in computer.

Later on when I try to access my name then this stream will be read from memory location and will be transformed into characters using the same algorithm which was used previously for transformation.

“The process of transforming characters into stream of bytes is called as Encoding”

“The process of transforming encoded bytes into characters is called as Decoding”

Microsoft says “Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.

Most popular types of Encoding Schemas

There are many encoding schemas which discovered\evolved time to time based on the requirements and shortcomings of previous ones. Let’s discuss most popular schemas and see how these evolved.

ASCII

ASCII stands for American standard code for Information Interchange. This code is basically used for identifying characters and numerals in a keyboard. These are 8 bit sequence code and are used for identifying letters, numbers and special characters.

ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 i.e. 128 distinct characters. The last bit (8th) is used for avoiding errors as parity bit.

Below is the ASCII chart shows mapping of character and its corresponding numeric value. This numeric value is then converted into 7 bit binary value and stored\transferred.

encoding schemas ascii

But 127 characters are not enough to capture complete language so ASCII started using 8th bit also to encode more characters to support language (to support “é”, in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters).

Below is the list of extended characters which are now supported by ASCII.

encoding schemas ex ascii

UTF

ASCII Extended solves the problem for languages that are based on the Latin alphabet. But what about the others languages which have completely different character set. How those languages will be encoded.

That’s the reason behind Unicode. Unicode doesn’t contain every character from every language, but it sure contains a gigantic amount of characters.

You can check entire Unicode character set here.

In the Unicode standard, a character is called as “Code Point”. Unicode character set is divided into 17 blocks. Each block is a continuous group of 65,536 (2^16) code points. Each block is called as “Plane”.

There are 17 planes, identified by the numbers 0 to 16.

Plane-0 is called as BMP (Basic Multilingual Plane)

Basic Multilingual Plane:

The first plane, plane 0, the Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. Most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

UTF is a super set of ASCII symbols. We need schemas which can transform set of characters into relevant Unicode code points. Below are few popular schemas which are used here.

UTF-7, UTF-8, UTF-16, UTF-32

Let’s discuss these

UTF-7

UTF-7 is an encoding that is used to encode Unicode characters by using only the ASCII characters. This encoding has the advantage that even in environments or operating systems that understand only 7-bit ASCII, Unicode characters can be represented and transferred.

For example, some Internet protocols such as SMTP for email, only allow the 128 ASCII characters and all other major bytes are not allowed. All of the other UTF encodings use at least 8 bits, so that they cannot be used for such purposes.

All those characters which are there in ASCII are converted to its normal ASCII codes. All other characters are encoded and also converted to ASCII characters. The + marks the beginning of such an encoding, the – (or any other character which cannot occur in the encoding) marks the end.

The German word for cheese “Käse”, for instance, would be coded as K+AOQ-se. The ASCII characters K, s and e will be the same, while the ä will be converted to AOQ (other ASCII characters). The beginning and the end of this coding are marked with + and -. Same way decoding happens.

Pros:

Compatibility with ASCII character set.

Cons:

Because of issues with robustness and security, you should not use UTF-7 encoding in 8-bit environments where UTF-8 encoding can be used instead.

UTF-8\UTF-16\UTF-32

Main difference between UTF-8, UTF-16 and UTF-32 character encoding is how many bytes it require to represent a character in memory.

UTF-8 uses minimum one byte, while UTF-16 uses minimum 2 bytes. BTW, if character’s code point is greater than 127, maximum value of byte then UTF-8 may take 2, 3 or 4 bytes but UTF-16 will only take either two or four bytes. On the other hand, UTF-32 is fixed width encoding scheme and always uses 4 bytes to encode a Unicode code point. 

Fundamental difference between UTF-32 and UTF-8, UTF-16 is that former is fixed width encoding scheme, while later duo is variable length encoding.

UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don’t have to worry about Big Endian / Little Endian issue.

UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn’t matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese, and Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit chars the primitive component of the string.

Note: .NET strings are UTF-16 because that’s an excellent fit with the operating system encoding, no conversion is required.

But why UTF-16?

This is because of history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. At that time to cover all these characters two bytes were enough. So at that time UCS-16 was adopted as standard.

To maintain compatibility with windows UCS-2 encoding, UTF-16 was adopted as new standard for in-memory transformations.

UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn’t. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • Its variable length, so counting or indexing codepoints is costly, though less than UTF-8.

UTF-32 pros:

  • This has fixed length i.e. 4 bytes, which fastens indexing by codepoint.

UTF-32 pros:

  • Takes most memory (4 bytes for all code points) compared to UTF-8/16.

 

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to Unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

That’s all about encoding now. In upcoming articles we’ll discuss more about encoding implementation.

0

Windows Data Access Components – Introduction

What is MDAC\WDAC?

MDAC – Microsoft Data Access Components. We call it WDAC also i.e. Windows Data Access Components.

This is a set of various components that enable your application to access any kind of data and make use of it. This data can be from Sql Server, Oracle or any other RDBMS. Not only RDBMS but the source can be any non RDBMS also like xml files, documents etc.

MDAC is not a single application or component, there are multiple libraries involved to make the set. Major components which we are going to discuss in this article are:

  • ODBC
  • OLEDB
  • ADO
  • Ado.Net

 

MDAC architecture may be viewed as three layers:

  1. A programming interface layer: Consisting of ADO and ADO.NET
  2. A database access layer: Developed by database vendors such as Oracle and Microsoft (OLE DB, .NET managed providers and ODBC drivers)
  3. The database itself.

These component layers are all made available to applications through the MDAC API.

The latest version of MDAC (2.8) consists of several interacting components, all of which are Windows specific except for ODBC (which is available on several platforms).

Let’s discuss these components.

ODBC

Introduction

The ODBC interface is an industry standard and a component of Microsoft Windows Open Services Architecture (WOSA). The ODBC interface makes it possible for applications to access data from a variety of database management systems (DBMS). ODBC permits maximum interoperability as application can access data in diverse DBMS through a single interface. Furthermore, that application will be independent of any DBMS from which it accesses data. Users of the application can add software components called drivers, which create an interface between an application and a specific DBMS.

Where to Use

An ODBC driver uses the Open Database Connectivity (ODBC) interface by Microsoft that allows applications to access data in database management systems (DBMS) using SQL as a standard for accessing the data.

Shortcomings

As mentioned ODBC is an open standard to access Sql based data. So this is typically used in RDBMS only. If we want to access data from other sources which are not RDMS then we need to think of some other solution.

This is the place where OLEDB comes into picture.

OLEDB

Introduction

As we know that ODBC was specifically meant for SQL databases. Due to this limitation of ODBC, OLEDB came into picture. OLEDB providers are able to fetch data from other data sources also.

OLE DB is an open specification designed to build on the success of ODBC by providing an open standard for accessing all kinds of data.

Where to Use

Whereas ODBC was created to access relational databases, OLE DB is designed for relational and non-relational information sources, including mainframe ISAM/VSAM and hierarchical databases; e-mail and file system stores; text, graphical, and geographical data; custom business objects; and more.

If a database supports ODBC and that database is on a server that don’t support OLE then ODBC is your best choice. 

Non-SQL environment: ODBC is designed to work with SQL. If you have non-SQL environment then OLE-DB is better choice.

Shortcomings

However there is no issues in OLEDB at all if you are using it with native code i.e. VB6, C etc. In these cases it will work fine.

But if you are planning to use it somehow with managed code then there will be some underlying plumbing which will convert these managed calls to unmanaged code. This plumbing is going to impact your system. But this is something which we can’t consider limitation of OLEDB.

ADO

Introduction

ADO is the strategic application programming interface (API) to data and information. It provides consistent, high-performance access to data and supports a variety of development needs, including the creation of front-end database clients and middle-tier business objects that use applications, tools, languages, or Internet browsers.

ADO is designed to be the one data interface needed for single and multi-tier client/server and Web-based data-driven solution development. The primary benefits of ADO are ease of use, high speed, low memory overhead, and a small disk footprint.

ADO provides an easy-to-use interface to OLE DB, which provides the underlying access to data. ADO is implemented for minimal network traffic in key scenarios, and a minimal number of layers between the front end and data source-all to provide a lightweight, high-performance interface.

Where to Use

ADO is easy to use because it uses a familiar metaphor-the COM automation interface, available from all leading Rapid Application Development (RAD) tools, database tools, and languages on the market today. It’s a nice wrapper for OLDDB.

ADO Performance Advantages: As with OLE DB, ADO is designed for high performance. To achieve this, it reduces the amount of solution code developers must write by “flattening” the coding model.

The programmer can create a recordset in code and be ready to retrieve results by setting two properties, then execute a single method to run the query and populate the recordset with results. The ADO approach dramatically decreases the amount and complexity of code that needs to be written by the programmer.

Shortcomings

ADO is based on COM Technology and it used OLEDB data provider for accessing data. It has a limited number of data types which are defined by the COM standard.

ADO works with connected data architecture. That means, when you access the data from data source, such as viewing or updating data, ADO recordset is keeping connection with the data source.

ADO can’t be integrated with XML because ADO have limited access of XML.

In ADO, You can create only Client side cursor.

Using a single connection instance, ADO cannot handle multiple transactions.

ADO.Net

Introduction

ADO.NET is a data access technology from the Microsoft .NET Framework that provides communication between relational and non-relational systems through a common set of components. ADO.NET is a set of computer software components that programmers can use to access data and data services from a database.

ADO.NET provides consistent access to data sources such as SQL Server and XML, and to data sources exposed through OLE DB and ODBC. Data-sharing consumer applications can use ADO.NET to connect to these data sources and retrieve, handle, and update the data that they contain.

Where to Use

In every managed code where we want to access data, we should use Ado.Net. It hides underlying implementations and provide with a cleaner model to access and manipulate data.

It is specifically for .net managed environment.

Below are few differences between ADO and ADO.Net

Windows Data Access Component

Complete picture

Windows Data Access Component

From above picture we can see that for every data access, ODBC and OLEDB are the core components which serve their specific purpose. Both components are equally important and required. However ADO.net have many benefits over ADO but it doesn’t mean that ADO is completely eliminated. ADO is still com based and should be used by com based application.

SNAC is also part of data access component which we’ve covered here