Before we can fully understand the origin of the character encoding and Python, it is necessary to make some basic concepts clear, although some concepts we touch and even use every day, but do not necessarily really understand it. For example: Byte, character, character set, character code, character encoding.
Bytes
The byte (byte) is an abstract computer unit of measurement. 8 binary data consisting of 0 and 1 are called 1 bytes (1byte=8bits). bytes are the basic unit of data storage in a computer .
All data in a computer, whether it is stored on a disk file or transmitted over a network (text, pictures, videos, audio files), is made up of bytes
Character
Character (Character) is also an abstract concept, the character of a unit of information, it is a variety of words and symbols collectively, such as an English letter is a character, a Chinese character is a char, a punctuation mark is also a character.
Character
The character set (Character set) is a collection of characters within a range, with different character sets, such as a total of 128 characters in the ASCII character set, including English letters, Arabic numerals, punctuation marks, and control characters. The GB2312 character set defines 7,445 characters and contains most of the kanji characters.
Character code
A character code point is a numeric number for each character in a character set, for example, the ASCII character set uses 0-127 consecutive 128 digits to represent 128 characters, for example, the character code number for "A" is 65. (In fact, it should be 01000001 such binary data, in order to facilitate people's memory, the ASCII code table by the decimal number to record the binary digits of the character)
Character encoding
Character encoding (Character Encoding) is a specific implementation scheme for mapping character codes in character sets to byte streams, and common character encodings include ASCII encoding, UTF-8 encoding, GBK encoding, and so on. In a sense, a character set corresponds to a character encoding, such as an ASCII character set that corresponds to an ASCII encoding. ASCII character encoding specifies that all characters are encoded using 7 bits in a single-byte low. For example, the number of "a" is 65, the single-byte representation is 0x41, so when writing to the storage device is B ' 01000001 '.
Encode, decode
Encoding is the process of converting a character into a byte stream, and the decoding process is to parse the byte stream into characters.
The evolution course of computer coding
1.ASCII yards
We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.
Computer invented in the United States, in the English world, characters commonly used characters are very limited, 26 letters (case), 10 digits, punctuation, control characters. These characters represent more than enough of a single byte of storage space in the computer.
The National Standards Association ANSI has developed a set of character encodings, which make a uniform provision for the relationship between English characters and bits. This is known as the ASCII code (American Standard Code for Information Interchange), which has been used so far.
The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.
2. Non-ASCII encoding
It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph.
As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol . For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.
3.Unicode
As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.
It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique character code, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.
Unicode is also a character encoding method. The scientific name for Unicode is "Universal multiple-octet Coded Character Set", referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".
UCS Specifies how multiple bytes are used to represent various words. How these encodings are transmitted is specified by the UTF (UCS Transformation Format) specification, and common UTF specifications include UTF-8, UTF-7, UTF-16.
there are two forms of Unicode: UCS-2 and UCS-4. UCS-2 is a two-byte encoding , a total of 16 bits, which can theoretically represent a maximum of 65,536 characters, but to show that all the characters in the world display 65,536 numbers is far from it, because there are nearly 100,000 of the characters of Light, So the Unicode4.0 specification defines a set of additional character encodings, which are UCS-4 with 4 bytes (actually only 31 bits, and the highest bit must be 0). The symbols used in all languages can theoretically be fully covered.
The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.
Unicode is equivalent to a translation machine between humans and computers, translating a character set that humans can understand into a character code (byte stream) that can be understood by a computer.
Limitations of 4.Unicode
It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.
The first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols?
The second problem is that when a Unicode character is transmitted over the network or eventually stored, it does not necessarily require two bytes per character, and we already know that the English alphabet (the character set contained in the ASCII encoding) is sufficient for a single byte, and Unicode specifies that the character code corresponding to each symbol , expressed in at least two bytes. Then every English letter must have at least one byte is 0, which is a great waste for storage, the size of the text file will be two or three times times larger, which is unacceptable.
The result of these two problems is: 1) there is a variety of Unicode storage methods, which means there are many different binary formats that can be used to represent the Unicode character set. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.
5.utf-8
When Unicode comes along with the advent of computer networks, how Unicode is transmitted over the network is also a must-have issue, so many UTF (UCS Transfer Format) standards for transmission appear, as the name implies, UTF8 is 8 bits each time the data is transmitted, and UTF16 is 16 bits each time, only for the reliability of transmission, from Unicode to UTF is not a direct correspondence, but to pass some algorithms and rules to convert.
The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. Again , the relationship here is that UTF-8 is one of the ways Unicode is implemented.
One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.
The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol. The extra bits complement 0.
The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.
Unicode symbol Range (hex) |
UTF-8 Encoding method (binary) |
0000 0000-0000 007F |
0xxxxxxx |
0000 0080-0000 07FF |
110xxxxx 10xxxxxx |
0000 0800-0000 FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0001 0000-0010 FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Below, or take the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 encoding.
Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.
6. Conversion between Unicode and UTF-8
Using the example in the previous section, you can see that the Unicode code for "strict" is 4e25,utf-8 encoding is E4B8A5, and the two are not the same. The transitions between them can be implemented by the program.
Under the Windows platform, one of the simplest ways to convert is to use the built-in Notepad applet Notepad.exe. After opening the file, click "Save as" on the "File" menu, you will get out of a dialog box, at the bottom there is a "coded" drop-down bar.
There are four options: Ansi,unicode,unicode big endian and UTF-8.
1) ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).
2) Unicode encoding refers to the UCS-2 encoding method, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format.
3) The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.
4) UTF-8 encoding, which is the encoding method mentioned in the previous section.
After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.
7. Little Endian and Big endian
As mentioned in the previous section, Unicode codes can be stored directly in the UCS-2 format. Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25. Storage, 4E in front, 25 in the back, is the big endian way, 25 in front, 4E in the back, is little endian way.
The two quirky names come from the book of Gulliver's Travels by British writer Swift. In the book, the Civil War broke out in the small country, the cause of the war is people arguing, whether to eat eggs from the big Head (Big-endian) or from the head (Little-endian) knocked open. For this matter, the war broke out six times, one Emperor gave his life, and the other emperor lost his throne.
Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).
Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file?
Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.
If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.
8. Example
Below, give an example.
Open Notepad program Notepad.exe, create a new text file, the content is a "strict" word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.
Then, use the "hex feature" in the text editing software UltraEdit to see how the file is encoded internally.
1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 coding, which also implies that GB2312 is stored in the big head way.
2) Unicode: Encoding is four bytes "ff fe 4E", where "FF fe" indicates a small head mode of storage, the true encoding is 4E25.
3) Unicode Big endian: The encoding is four bytes "Fe FF 4E 25", wherein "FE FF" indicates that the head is stored in the way.
4) UTF-8: The encoding is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" indicates that this is UTF-8 encoding, and after three "E4B8A5" is the specific code of "strict", its storage sequence is consistent with the encoding order.
Reference Source:
Https://www.cnblogs.com/gavin-num1/p/5170247.html
Https://foofish.net/python-unicode-error.html
Understand the coding applications in Python