Understanding byte order [Understanding Big and Little Endian byte order]

Source: Internet
Author: User

Original address

(This article is very clear and easy to understand for the byte-order explanation.) )

Problems with byte order is frustrating, and I want to spare you the grief I experienced. Here ' s the key:

    • Problem:computers speak different languages, like people. Some Write Data "Left-to-right" and others "Right-to-left".
      1. A machine can read it own data just fine-problems happen when one computer stores data and a different type tries to re Ad it.
    • Solutions

Agree to a common format (i.e., all network traffic follows a single format), or always include a header that describes th e format of the data. If the header appears backwards, it means data is stored in the other format and needs to be converted.

Numbers vs. Data

The most important concept are to recognize the difference between a number and the data that represents it.

A number is an abstract concept, such as a count of something. You have ten fingers. The idea of "ten" doesn ' t change, no matter what representation you use:ten, ten, Diez (Spanish), Ju (Japanese), 1010 (Bin ary), X (Roman numeral) ... these representations all point to the same concept of "ten".

Contrast this with data. Data is a physical concept, a raw sequence of bits and bytes stored on a computer. Data has no inherent meaning and must are interpreted by whoever it.

Data is like human writing, which are simply marks on paper. There is no inherent meaning in these marks. If we see a line and a circle (like this: | O) We may interpret it to mean "ten".

But we assumed the marks referred to a number. They could has been the letters "IO", a moon of Jupiter. Or perhaps the Greek goddess. Or maybe an abbreviation for input/output. Or someone ' s initials. Or the number 2 in binary ("10"). The list of possibilities goes on.

The point is, a single piece of data (| O) can be interpreted in many ways, and the meaning are unclear until someone clarifies the intent of the author.

Computers face the same problem. They store data, not abstract concepts, and does so using a sequence of 1 ' and 0 ' s. Later, they read back the 1 's and 0 ' s and try to recreate the abstract concept from the raw data. Depending on the assumptions made, the 1 ' and 0 ' s can mean very different things.

Why does this problem happen? Well, there's no rule that's all computers must use the same language, just like there's no rule all humans need to. Each of the type of computer is internally consistent (it can read back its own data), but there is no guarantees about how Anot Hertype of computer would interpret the data it created.

Basic Concepts

    • Data (Bits and bytes, or marks on paper) is meaningless; It must is interpreted to the create an abstract concept, like a number.
    • Like humans, computers has different ways to store the same abstract concept. (i.e., we have many ways to say "ten": Ten, ten, Diez, etc.)

storing Numbers as Data

Thankfully, most computers agree to a few basic data formats (this is not always the case). This gives us a common starting point which makes our lives a bit easier:

    • A bit has both values (on or off, 1 or 0)
    • A byte is a sequence of 8 bits
      1. The "leftmost" bit in a byte is the biggest. So, the binary sequence 00001001 is the decimal number 9. 00001001 = (23 + 20 = 8 + 1 = 9).
      2. Bits is numbered from Right-to-left. Bit 0 is the rightmost and the smallest; Bit 7 is leftmost and largest.

We can use these basic agreements as a building block to exchange data. If we store and read data one byte at a time, it'll work on any computer. The concept of a byte is the same in all machines, and the idea of which byte is first, second, third (byte 0, byte 1, Byt E 2 ...) Is the same on all machines.

If computers agree on the order of every byte, what ' s the problem?

Well, the is fine-single-byte data, like ASCII text. However, a lot of data needs to being stored using multiple bytes, like integers or floating-point numbers. And there is no agreement on how these sequences should be stored.

Byte Example

Consider a sequence of 4 bytes, named W X Y and Z-i avoided naming them a B C D because they are hex digits, which would Be confusing. So, each of the bytes has a value and are made up of 8 bits.

Byte Name:    W       X       Y       Z location:     0       1       2       3 Value (hex):  0x12    0x34    0x56    0x78

For example, a entire byte, 0x12 in hex or 00010010 in binary. If W were to is interpreted as a number, it would be "all" in decimal (by the the, there's nothing saying we had to Interp RET it as a number-it could is an ASCII character or something else entirely).

With me so far? We have 4 bytes, W X Y and Z, each with a different value.

Understanding Pointers

Pointers is a key part of programming, especially the C programming language. A pointer is a number, a references a memory location. It is up to us (the programmer) to interpret the data at the.

In C, if you cast a pointer to certain type (such as a char * or int *), it tells the computer what the data At the location. For example, let's declare

void *p = 0; P is a pointer to an unknown data type
P is a NULL pointer--does not dereference
Char *c; c is a pointer to a char, usually a single byte

Note that we can ' t get the data from P because we don ' t know it type. P could is pointing at a single number, a letter, the start of a string, your horoscope, an image--we just don ' t know Ho W many bytes to read, or how to interpret what ' s there.

Now, suppose we write

c = (char *) p;

Ah--Now this statement tells the computer to point to the same place as P, and interpret the data as a single character (char is typically a single byte, with uint8_t if not true on your machine). In the this case, C would point to memory location 0, or byte W. If We printed c, we ' d get the value in W, which are hex 0x12 (remember that W is a whole byte).

This example does does depend on the type of computer we had--again, all computers agree on what a single byte was (in th E past this is not the case).

The example is helpful, even though it was the same on all computers--if we had a pointer to a single byte (char *, a SI Ngle byte), we can walk through memory, reading off a byte at a time. We can examine any memory location and the endian-ness of a computer won ' t matter--every computer would give back the SAM E information.

so, what ' s the problem?

Problems happen when computers try to read multiple bytes. Some data types contain multiple bytes, like long integers or floating-point numbers. A single byte have only the values, so can store 0-255.

Now problems start-when you read multi-byte data, where does the biggest byte appear?

    • Big endian machine:stores data big-end first. When looking at multiple bytes, the first byte (lowest address) is the biggest.
    • Little endian machine:stores data little-end first. When looking at multiple bytes, the first byte is smallest.

The naming makes sense, eh? Big-endian thinks the big-end is first. (By the the-the-Big-endian/little-endian naming comes from Gulliver's travels, where the Lilliputans argue over whether To break eggs on the little-end or big-end. Sometimes computer debates is just as meaningful:-))

Again, endian-ness does not matter if you had a single byte. If you have one byte, it's the only data you read so there's only one-to-interpret it (again, because computers agree On what a byte is).

Now suppose we had our 4 bytes (W X Y Z) stored the same-on-a-big-and-machine. That's, memory location 0 are W on both machines, memory location 1 is X, etc.

We can create this arrangement by remembering that bytes is machine-independent. We can walk memory, one byte at a time, and set the values we need. This would work on the any machine:

c = 0; Point-to-location 0 (won ' t work on a real machine!)
* C = 0x12; Set W ' s value
c = 1; Point-to-location 1
*c = 0x34; Set X ' s value
...//Repeat for Y and Z; Details left to reader

This code would work on the any machine, and we had both set up with bytes W, X, Y and Z in locations 0, 1, 2 and 3.

Interpreting Data

Now let's do a example with Multi-Byte data (finally!). Quick review:a "Short int" is a 2-byte (16-bit) number, and which can range from 0-65535 (if unsigned). Let's use it on an example:

Short *s;    Pointer to a short int (2 bytes) s = 0;       Point-to-location 0; *s is the value

So, s are a pointer to a short, and are now looking at byte location 0 (which have W). What happens if we read the value at s?

    • Big Endian Machine:i think a short is the bytes, so I'll read them off:location s is address 0 (W, or 0x12) and location S + 1 is address 1 (X, or 0x34). Since the first byte is biggest (I ' m big-endian!), the number of must be to be * byte 0 + byte 1, or 256*w + X, or 0x1234. I multiplied the first byte by a 2^8 because I needed to shift it over 8 bits.
    • Little endian machine:i don ' t know what Mr. Big Endian is smoking. Yeah, I agree a short is 2 bytes, and I'll read them off just like him:location s are 0x12, and location S + 1 is 0x34. The first byte is the littlest! The value of the is byte 0 + * * Byte 1, or 256*x + W, or 0x3412.

Keep in mind this both machines start from location s and read memory going upwards. There is no confusion-about-0 and location 1 mean. There is no confusion, a short is 2 bytes.

But does you see the problem? The Big-endian machine thinks s = 0x1234 and the Little-endian machine thinks s = 0x3412. The same exact data gives the different numbers. Probably not a good thing.

yet another example

Let's do another example with 4-byte integer for "fun":

int *i; Pointer to a int (4 bytes on 32-bit machine)
i = 0; Points to location zero, so *i is the value there

Again We ask:what is the value at I?

    • Big endian Machine:an int is 4 bytes, and the first is the largest. I read 4 bytes (w X Y Z) and W is the largest. The number is 0x12345678.
    • Little endian machine:sure, an int are 4 bytes, but the first is smallest. I also read W X Y Z, but W belongs on the back--it's the littlest. The number is 0x78563412.

Same data, different results-not a good thing. Here's a interactive example using the numbers above, feel free to plug in your own:big and Little Endian Byte Order

The nuxi problem

Issues with byte order is sometimes called the Nuxi Problem:unix stored on a Big-endian machine can show up as Nuxi on a Little-endian one.

Suppose we want to store 4 bytes (U, N, I and X) as-Shorts:un and IX. Each letter was a entire byte, like our WXYZ example above. To store the shorts we would write:

Short *s;     Pointer to set shortss = 0;        Point to location 0*s = UN;      Store First Short:u * + N (fictional code) s = 2;        Point to next location*s = IX;      Store Second SHORT:I * + X

This code isn't specific to a machine. If we Store "un" on a-machine and ask-to-read it back, it had better is "un"! I don ' t care about endian issues, if we store a value on one machine and read it back on the same machine, it must is the Same value.

However, if we look at memory one byte at a time (using our char * trick), the order could vary. On a big endian machine we see:

Byte:      U  N  I  xlocation:  0  1  2  3

which make sense. U is the biggest byte in "UN" and is stored first. The same goes for Ix:i is the biggest, and stored first.

On a Little-endian machine we would see:

Byte:      N  U  X  ilocation:  0  1  2  3

And this makes sense also. "N" is the littlest byte in "UN" and is stored first. Again, even though the bytes was stored "backwards" in memory, the Little-endian machine knows it was little endian, and in Terprets them correctly when reading the values back. Also, note that we can specify hex numbers such as x = 0x1234 on any machine. Even a Little-endian machine knows what do you mean when you write 0x1234, and won ' t force you to swap the values yourself (y OU specify the hex number to write, and it figures out the details and swaps the bytes in memory, under the covers. Tricky.).

This scenario is called the "nuxi" problem because byte sequence UNIX was interpreted asnuxi on the other type of machine. Again, this is the only a problem if you exchange data--all machine is internally consistent.

exchanging Data between Endian machines

Computers is Connected-gone is the days when a machine is only had to worry on reading its own data. Big and Little-endian machines need to talk and get along. How does they do?

solution 1:use a Common Format

The easiest approach is to agree to a common format for sending data over the network. The standard network order was actually Big-endian, but some people get uppity that Little-endian didn ' t win ... we ll just Call it "network order".

To convert data to network order, machines call a function Hton (host-to-network). On a Big-endian machine this won ' t actually does anything, but we won ' t talk about it here (the Little-endians might get M AD).

But it's important to use Hton before sending data, even if you are Big-endian. Your program could be popular it's compiled on different machines, and your want Your code to be portable (don ' t?).

Similarly, there is a function Ntoh (network to host) used to read data off the network. You need the sure is correctly interpreting the network data into the host ' s format. You need to know the type of data is receiving to decode it properly, and the conversion functions is:

Htons ()  -"host to Network Short" htonl ()  -' host to Network Long ' Ntohs ()  -"Network to Host Short" ntohl () 
    -"Network to Host Long"

Remember. A single byte is a single byte, and the order does not matter.

These functions is critical when doing low-level networking, such as verifying the checksums in IP packets. If you don ' t understand endian issues correctly your life'll be painful-take my word on the this one. Use the translation functions, and know why they is needed.

solution 2:use a Byte Order Mark (BOM)

The other approach are to include a magic number, such as 0xFEFF, before every piece of data. If you read the magic number and it was 0xFEFF, it means the data is in the same format as your machine, and all are well.

If you read the magic number and it is 0xFFFE (it's backwards), it means the data was written in a format different from Your own. You'll have the to translate it.

A few points to note. First, the number is ' t really magic, but programmers often use the term to describe the choice of a arbitrary number (th E BOM could has been any sequence of different bytes). It's called a byte-order mark because it indicates the byte order the data is stored in.

Second, the BOM adds overhead to all data, which is transmitted. Even if you is only sending 2 bytes of the data, you need to include a 2-byte BOM. ouch!

Unicode uses a BOM when storing multi-byte data (some Unicode character encodings can has 2, 3 or even 4-bytes per Charac ter). XML avoids this mess by storing data in UTF-8 by default, which stores Unicode information one byte at a time. And why are this cool?

(repeated for the 56th time) "Because endian issues don ' t matter for single bytes".

Right is.

Again, other problems can arise with BOM. What if do forget to include the BOM? Do you assume the data is sent in the same format as your own? Do you read the data and see if it looks "backwards" (whatever, means) and try to translate it? What if regular data includes the BOM by coincidence? These situations is not a fun.

Why is there Endian issues at all? Can ' t We Just Get along?

Ah, what a philosophical question.

Each byte-order system have its advantages. Little-endian machines let you read the lowest-byte first, without reading the others. You can check whether a number are odd or even (last bit is 0) very easily, which are cool if you ' re to that kind of thing . Big-endian systems store data in memory the same-on-humans think about data (left-to-right), which makes low-level deb Ugging easier.

But why didn ' t everyone just agree to one system? Why does certain computers has to try and is different?

Let me answer a question with a question:why doesn ' t everyone speak the same language? Why is some languages written left-to-right, and others right-to-left?

Sometimes communication systems develop independently, and later need to interact.

epilogue:parting Thoughts

Endian issues is a example of the general encoding problem-data needs to represent an abstract concept, and later the Concept needs to is created from the data. This topic deserves it own article (or series), but you should has a better understanding of endian issues. More information:

    • Wikipedia entry
    • Endian FAQ

Understanding byte order [Understanding Big and Little Endian byte order]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.