In the process of programming, often encounter confusing garbled problem. Many people choose the problem directly on the Internet to find the answer, the other people's examples copied over, this is a quick solution to the problem of a good way. However, as a rigorous and realistic developer, if not from the source of a thorough understanding of the mechanism of garbled, and thus seek to solve the fundamental path of the problem, then can never get rid of the shadow of the Code farmers. Here's a look at the ins and outs of the computer coding problem.
Ascii
It is well known that all data in a computer, whether text, images, video, or audio files, is essentially stored in a binary form similar to 01010101. However, the characters in the computer cannot be represented in this way entirely. Since the computer was originally invented by the Americans, the original computer code was also used by the American Standard, namely ASCII (American Standard Code for information Interchange, US Information Interchange standards codes). The ASCII code specifies a total of 128 characters, such as the uppercase letter A is 65 (binary 01000001), the symbol @ is encoded at 64 (binary 01000000). Of these 128 symbols, 0~31 and 127 (a total of 33) are control characters or communication characters, and 32–126 are are assigned to characters that can be found on the keyboard and can be printed out. All ASCII-encoded content, which occupies only one byte of the following 7 bits, is set at a maximum of 0.
Later, in order to be able to represent the European region in addition to the English alphabet other than the letter, there has been extended ASCII encoding. The extended ASCII contains the original 128 characters, and adds 128 characters, all 256. The highest bit of code is 1, so it can be fully compatible with ASCII code. Can represent characters such as phonetic transcription, (code 145, binary 10010001), and letters in French (encoded as 130, binary 10000010), and so on.
This code can represent the phonetic transcription and most of the European non-English alphabet, but it is not an international standard, in different countries, 128 to 255 of the corresponding characters are not exactly the same, resulting in a variety of different extended ASCII encoding. For example, the iso8859-1 character set, also known as Latin-1, joins Western European characters commonly used, including the letters of the German and French countries. The iso8859-2 character set, also known as Latin-2, collects Eastern European characters. The iso8859-3 character set, also known as Latin-3, collects southern European characters, and so on.
Is this coding enough? Obviously not enough, such as Chinese characters, can not be expressed in ASCII. Extended ASCII is also far from enough.
GBK
Chinese people have made many efforts in order to use computer as a great fangming. GB2312 was the result of this effort, which was released in 1980 and was launched on May 1, 1981. It marks an important step in the use of electronic computers in our country. The GB2312 encoding contains 6,763 Chinese characters and is also compatible with ASCII. This character encoding basically satisfies the Chinese character's computer processing need, it contains the Chinese characters already covers the Chinese mainland 99.75% the use frequency, to some ancient Chinese and the traditional characters GB2312 cannot handle. Later, on the basis of GB2312 created a code called GBK, officially released in 1995. GBK not only included all the Chinese characters in GB 2312, non-Chinese characters, but also included the Chinese characters appearing in Korean, such as the Korean famous Wai chess player Li Shi he sedol sedol GBK code is 0x8168 (0x means 16). Here you can query the encoding of the Chinese characters.
GBK encoding typically uses two bytes to represent a character, and if it is an English letter, one character is the same as the ASCII encoding, so GBK is also compatible with ASCII encoding, but not compatible with any extended ASCII encoding. This can be seen from its coded sequence.
The GBK uses a double-byte representation, the overall encoding range is 0x8140-0xfefe (1000000101000000-1111111011111110), the first byte is between 0x81-0xfe, and the tail byte is between 0x40-0xfe. It can be seen that the highest bit of the first byte is 1, so that if the byte after the tail byte is the highest bit 0, then it can be resolved to an ASCII encoded character, or a continuous two-byte character.
Unicode
There are many languages in the world, is there a way to encode characters in all languages? The answer is yes. Unicode encoding is designed to meet this need. Unicode is a large collection, and the current size can accommodate 100多万个 symbols. Each symbol is encoded differently, so many characters, to be represented in binary form, require more than one by one bytes to correspond. Standard Unicode uses 4 bytes to represent a string. This four-byte binary code, called the code point for this character. For example, u+0639 indicates that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+ 4e6d denotes the Chinese character "Sedol". Access unicode.org can query the specific symbol table.
The method of using 4 bytes to represent a character is obviously not scientific enough, since many English letters require only one byte to represent them, and a four-byte representation of them can cause a lot of waste. So there is the UTF-8 code.
Unicode only specifies how characters are encoded and does not specify how they are stored and transmitted. UTF-8 encoding is a way of implementing Unicode encoding, which specifies that you can use 1~4 bytes to represent a character, varying the length of the byte depending on the character you want to behave, and the English letter is expressed in 1 bytes, and the kanji is represented by 2-3 bytes.
So the problem is, because the string in the computer is a continuous 0101 encoding, how to represent a character in the Unicode encoding table code point, but also can let the computer understand that the continuous encoding string in a byte is an English letter, Instead of the coded string in front of him, the characters are represented by two or three bytes. UTF-8 's coding designers have cleverly solved this problem.
English characters these can be represented by the ASCII code of the character with UTF-8 only need a byte of space, and ASCII is the same. For multibyte (n-byte) characters, the first n is set to 1, the n+1 bit is set to 0, and the first two bits of the last byte are set to 10. All remaining bits are populated with the Unicode code of the character.
Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
-----------------------+---------------------------------------------
0000 0000~0000 007F | 0xxxxxxx
0000 0080~0000 07FF | 110xxxxx 10xxxxxx
0000 0800~0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000~0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
This encoding is very well understood, if the first bit of a byte is 0, then this byte corresponds to a character, if the first bit is 1, then look at the number of successive consecutive 1, it means that the character occupies a number of bytes. For example, the Unicode code point for "Me" is 0x6211, binary 110001000010001, falls within the range of the third row (0000 0800~0000 FFFF), so "I" needs three bytes in the format "1110xxxx 10xxxxxx 10xxxxxx ". Then, starting with the last bits of the "Me", the X in the format is filled in, sequentially, and the extra bits complement 0. This gets the "I" UTF-8 code is "11100110 10001000 10010001", converted to 16 binary is E68891, this is the final stored in the computer binary encoding.
Here point out a misunderstanding, there are many online UTF8 encoding conversion tools, claiming that the Chinese characters can be converted to UTF-8 encoding, in fact, most of the tools just convert Chinese characters to the corresponding Unicode code point, not really in the storage and transmission process of utf-8 encoding.
In addition to UTF-8, Unicode is implemented in UTF-16, UTF-32. UTF-16 uses four bytes to represent one character, and UTF-32 uses the standard 4 bytes to represent one character, corresponding to its Unicode code point one by one. Regardless of the form of representation, the Unicode code points for the same character are the same, except that the code points are converted in different ways when it is stored and transmitted.
Python character encoding
Let's start by talking about coding in python.
Python's default encoding is ASCII, which is related to its birth background, Python was born in 1989, Unicode was officially released in 1994, and no Unicode was available at the beginning of Python, only ASCII was selected. Later, many improvements were made to make it suitable for users of non-English departments.
If you do not make changes, Python uses ASCII to encode all code, including comments.
>>> Import Sys
>>> sys.getdefaultencoding ()
' ASCII '
If Chinese is present in the code, Python is stored using the default encoding of Windows. Simplified Chinese environment is GBK
>>> str= ' Hello '
>>> Str
' \xc4\xe3\xba\xc3 '
However, ASCII is still used at compile time.
#stringtest. py
print ' Hello '
File "string.py", line 1
Syntaxerror:non-ascii character ' \xe6 ' in the file d:/mygit/taobaospider/string.py on line 1, but no encoding declared; See http://python.org/dev/peps/pep-0263/for details
If you want to use Chinese in your code, be sure to declare the encoding of the file at the beginning of the code (first or second line), for example, by setting the encoding to UTF-8
# coding=<utf-8>
Or
# coding=<utf-8>
#!/usr/bin/python
In this way, you can use Chinese in your code.
Finish
Coding issues in the Python language