A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
One, the difference between ASCII, Unicode, and UTF-8
Because of the problem of character encoding, and then read a large number of blogs, and then carried out a certain test, the basic understanding of the cause and consequences of the coding problem.
1. Character Set and character encoding
The information stored in the computer is represented by a binary number, and the characters we see on the screen, such as English and kanji, are the result of binary conversion. Popularly speaking, according to what rules to store characters in the computer, such as ' a ' with what is called "coding", conversely, the stored in the computer binary number resolution display, called "Decoding", like cryptography and decryption. In the decoding process, if the wrong decoding rules are used, it causes ' a ' to parse to ' B ' or garbled.
Character Set (Charset) : is a collection of all the abstract characters supported by a system. Characters are all kinds of words and symbols, including the national text, punctuation, graphic symbols, numbers and so on.
character encoding (Character Encoding) : is a set of rules that can be used to pair a set of natural language characters (such as an alphabet or a syllable table) with a set of other things (such as a number or electrical pulse). That is, it is a basic technique of information processing to establish correspondence between symbol set and digital system . Often people use symbolic collections (usually text) to express information. Computer-based information processing system is the use of components (hardware) of different states of the combination to store and process messages. The combination of different states of a component can represent numbers in a digital system, so character encoding is the number of digital systems that a computer can accept, called a digital code.
Common character Set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, and so on. To accurately handle a variety of character set text, the computer needs character encoding, so that the computer can recognize and store a variety of text.
2. ASCII code
We know that inside the computer, all the information is ultimately represented as a binary string . Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.
In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits . This is known as ASCII code and has been used so far.
The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.
3, non-ASCII encoding
It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.
However, there are new problems. different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 represents E in French encoding, but in Hebrew it represents the letter Gimel (), which in the Russian language also represents another symbol. But anyway, in all of these encodings, 0--127 represents the same symbol, not just the 128--255 section.
As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.
The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8.
There are many coding methods in the world, and the same binary numbers can be interpreted into different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.
It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.
Unicode is of course a very large collection , now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table .
5. Problems with Unicode
It is important to note thatUnicode is just a set of symbols , which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.
There are two serious problems here. , the first question is, how can you differentiate between Unicode and ASCII?
How does the computer know that three bytes represents a symbol instead of three symbols?
The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.
They result in:
1) There is a variety of Unicode storage methods, i.e. there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.
The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet . Other implementations include UTF-16 (characters in two-byte or four-byte notation) and UTF-32 (characters in four-byte notation), but not on the Internet. Again, the relationship here is thatUTF-8 is one of the ways Unicode is implemented.
One of the biggest features of UTF-8 is that it is a variable-length coding method . It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.
The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. so for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes, Utf-8 is one of the implementations of Unicode, which is a variable-length encoding, varying the length of bytes according to the symbol, Represents a character in 1-4 bytes
Unicode incompatible ASCII code, UTF-8 compatible ASCII code
Two, Python coding problem
Python coding problem points: string encoding and data file encoding
1. Python string encoding problem
In the latest version of Python 3, strings are encoded in Unicode, meaning that Python's strings support multiple languages,
In : Print ("China") China in : Type ("China") out: str
because Python's string type is str, it is represented in memory in Unicode and a character corresponds to several bytes.
If you want to transfer on the network, or save to disk, you need to turn str into bytes in bytes.
Python3 data for bytes types is represented by single or double quotation marks with a B prefix:
In [approx]: x = B "Hello" in [PNs]: Type (x) out: Bytesin [max]: print (x) b ' Hello ' in [max]: y = "Hello" in [max]: Type (y) out: s Trin [x]: Print (y) helloin [n]: B ' China ' #字节只能包含ASCII字符, can not display Chinese, here error File "<ipython-input-19-5365596ad95c>", L Ine 1 B ' China ' ^syntaxerror:bytes can only contain ASCII literal characters.
Be aware of the distinction between ' hello ' and b ' hello ', which is STR, which, although the content is the same as the former, but each character of bytes occupies only one byte.
still a little dizzy here,Decode is decoding , decoding other encoded strings into Unicode,encode encoding Unicode strings into other encoded strings, it is important to note that This two process has nothing to do with ASCII code .
The role of Decode and encode was previously thought to be the encoding of strings in Unicode and ASCII conversions,
It is unclear to see the following paragraph:
The most important new feature of Python 3 is probably a clearer distinction between text and binary data .
Text is always Unicode, denoted by the STR type,
Binary data is represented by the bytes type.
Python 3 does not mix str and bytes in any implicit way, which makes the distinction between them particularly clear. You cannot stitch strings and byte packets, search for strings in a byte packet (or vice versa), or pass a string into a function with a byte packet (or vice versa). It's a good thing.
In any case, the line between the string and the byte packet is inevitable, and the following diagram is important to keep in mind:
650) this.width=650; "src=" http://www.linuxidc.com/upload/2015_08/15081715279667.jpg "alt=" Python 3 description of the encoding conversion for strings " Style= "border:0px;"/>
diagram, very important
A string can be encoded into a byte packet, and a byte packet can be decoded into a string.
in [+]: y1 = ' Hello '. Encode () in [Max]: print (y1) b ' Hello ' in [max]: Type (y1) out: Bytesin [+]: y2 = Y1.decode () in [P]: P. Rint (y2) helloin [+]: type (y2) out: Strin : M = "China" in [: m1 = M.encode () in : m1out: B ' \xe4\xb8\xad\xe5\ X9b\xbd ' in [the]: type (m1) out: Bytesin [all]: m2 = M1.decode () in : m2out: ' China ' in [+]: type (m2) out: str
The question is to be seen in this way:
A string is an abstract representation of text. Strings are composed of characters, and the characters are abstract entities that are not related to any particular binary representation. We live in the ignorance of happiness while manipulating strings. We can split and shard strings, and we can stitch and search strings. We don't care how they are represented internally, and each character in the string is saved in a few bytes. It is only when the strings are encoded into a byte package (for example, to send them on a channel) or decoded from a byte packet (reverse operation) that we begin to pay attention to this.
The parameters passed in encode and decode are encoded (or codec). Encoding is a way of representing abstract characters in binary data. There are many kinds of coding at present. The UTF-8 given above is one of them,
Coding is a critical part of this transformation process. Away from the code, Bytes object B ' \xa420 ' is just a bunch of bits. The encoding gives it meaning. With different encodings, the meaning of this heap of bits can be very different:
"Python advanced" 02, Python coding problem
Start building with 50+ products and up to 12 months usage for Elastic Compute Service