Talk about coding and decoding (understanding Bytes,utf-8,ascii,unicode)

Source: Internet
Author: User
Introduction

Every time you sit down and write a blog summary, you will have a long sigh of relief, from being bothered by a problem and finally figuring out the principle, and the process of experiencing it is like slowly resuming your breathing from a suffocating environment. As a computer professional student, university four years of time unexpectedly never think carefully coding and decoding details are ashamed, only remember the teacher told a truth: "The problem of the unified selection of UTF8." Know it but do not know why, I have many times despised my soul, at this moment is countless times. The origin of character encoding

The history of character encoding there are a lot of articles on the Internet, in fact, to understand these coding confusion on the solution half, I try to see if I can explain clearly.

We know that the inventor of the computer is America's Big brothers, from the day of the advent of the computer, the problem of coding accompanied with the advent. Because the computer can only store 0-1 form of machine code, in fact, 0-1 corresponds to the computer hardware in a switch to open and close. For programmers, it's obviously unfriendly to write a program with 0-1 of these machine codes, so there is the programming language, the programming language is some we can read the character, it greatly liberated the programmer, but on the other hand to human friendly also means that the machine is not friendly, the computer is not recognize these characters, So naturally encoded and decoded on the stage of history.

The first occurrence of the encoding format is the ASCII code, this coding rule is made by the Americans, the general rule is to use a byte (8 bit) to represent the occurrence of characters, in fact, because in the world in the United States, there are not more than 128 characters, and a byte can represent 256 characters, So there was no problem with the way the code was encoded.

       later computers in the world, the language of different countries are faced with how to express problems in the computer, such as our Chinese characters commonly used in thousands of, apparently the first byte of the ASIIC code is not enough, This time there is a Unicode encoding, it is only a representation of the rules, does not correspond to the specific implementation of the form. uni-This prefix in English means the meaning of unity, it attempts to express the language of the world in a unified code, but Unicode only specifies the binary data for the character, but does not specify that the binary data is stored in memory with a few bytes, and then it is messed up. Countries in the realization of Unicode have played their own ingenuity, there have been similar forms of utf-16,utf-32 and so on, in this case, the ideal of Unicode has not been achieved, until the popularization of the Internet, Utf-8, the emergence of utf-8 really realized the unification, While implementing the Unicode specification, it extends its own rules, utf-8 that any one character-encoded machine code will occupy 6 bytes. about bytes

A lot of people here have a misunderstanding, is easy to bytes and programming language of other data types confused, in fact, bytes is the real data type of the computer, but also the network data transmission in the only format, what JSON, XML these format strings eventually want to transfer to the bytes data type can be transmitted through the socket, and bytes data and string type data conversion is the encoding and decoding of the conversion, UTF-8 is the format specified in the codec.

Here's a little bit of serialization and deserialization, serialization can be divided into local and network, for local serialization, is often to the memory of the object persisted to the local hard disk, at this time serialization is done is the object and some objects related information serialized into a string, and then the string in some format (such as Utf-8) Encode into bytes type and store to hard disk. Deserialization is the first time the data in the bytes type in the hard disk is read into memory decoded into a string, and then the string is deserialized to resolve the build object.

The data transferred in the network is typically JSON and XML, where serialization refers to a string that converts an object to a JSON type, which also requires that the serialized string be transformed into a bytes type, and deserialization is the bytes type of data that is first converted to a JSON-type string ( This step-by-step framework will tend to do it for us, and then true deserialization is the object that converts a JSON-type string into a JSON type. (note here to differentiate between JSON-type strings and JSON-type objects) last

Finally, I would like to stress again, do not confuse serialization and deserialization with coding and decoding problems, this is two different dimensions of the concept, if you do not understand the problem of codec, do not worry about coding problems in the program, remember to think more about it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.