Source: http://www.jb51.net/article/92006.htm
The following is the full text:
These two days to write a crawler to monitor the web, the role is to track a Web page changes, but run a night there is a problem .... I hope you will be very generous to enlighten!
I'm using Python3, the error is thrown on the HTML response decode, and the code is:
Response = Urllib.urlopen (dsturl) content = Response.read (). Decode (' Utf-8 ')
Throws an error to
File "./unxingcrawler_p3.py", line A, in getnewphones content = Response.read (). Decode () Unicodedecodeerror: ' UTF8 ' Codec can ' t decode byte 0xb2 in position 24137:invalid start byte
Before the operation is not a problem, after a night has appeared .... The last thing to understand is why utf-8 cannot be parsed in a webpage that it declares to be utf-8 encoded?
Later after the reminder of enthusiastic netizens, only to find the need to usedecode(‘utf-8‘, ‘ignore‘)
In order to completely understand the Python coding problem, share the following, I hope to familiarize you with the Python coding problem to bring some help
1. From the byte:
A byte consists of eight bits, each of which represents 0 or 1, and a byte can represent a total of 2^8=256 numbers from 00000000 to 11111111. An ASCII encoding uses one byte (the highest bit of the byte is removed as the Tochi parity bit), and the ASCII encoding actually uses 7 bits in a byte to represent the character, which can represent a total of 2^7=128 characters. For example, the ASCII encoding of 01000001 (that is, the decimal 65) represents the character ' a ', 01000001 plus 32 after the 01100001 (that is, the Decimal 97) represents the character ' a '. Now open python, call CHR and the Ord function, and we can see that Python has translated the ASCII encoding for us.
The first 00000000 represents a null character, so the ASCII encoding actually includes only 127 characters, such as letters, punctuation marks, special symbols, and so on. Because ASCII is born in the United States, it is enough for English words to be made up of letters and words. But people in the Chinese, Japanese, Koreans and other languages are not satisfied. Chinese is a word, ASCII encoding with all the 256 characters are not enough.
Therefore, Unicode encoding was subsequently presented. Unicode encoding is usually made up of two bytes, representing a total of 256*256 characters, known as UCS-2. Some remote words also use four bytes, the so-called UCS-4. This means that the Unicode standard is still evolving. But UCS-4 appear less, we first remember: the most original ASCII encoding using a byte encoding, but because of the language differences in the number of characters, people use two bytes, there is a unified, multi-language Unicode encoding.
In Unicode, the original 127 characters in ASCII only need to be preceded by a full zero byte, for example, the character ' a ': 01100001, in Unicode becomes 00000000 01100001. Soon, the Americans are unhappy, eat the world nationality same big pot, originally only a byte can transmit English now become two bytes, very waste storage space and transfer speed.
People again to play the wisdom, then appeared UTF-8 code. Because of the problem of space waste, this UTF-8 encoding is variable in length , from one byte of the English alphabet to the usual three bytes in Chinese, to six bytes of some uncommon characters. Solve the space problem, UTF-8 encoding also has a magical additional function, that is compatible with the ASCII code of Big Brother. Some old-fashioned software can now continue to work in UTF-8 encoding.
Note that in addition to the English alphabet, Chinese characters are usually different in Unicode encoding and UTF-8 encoding. For example, the ' medium ' character of Chinese characters is 01001110 00101101 in Unicode, and 11100100 10111000 10101101 in UTF-8 encoding.
The mother of our motherland naturally has its own set of standards. That's GB2312 and GBK. Of course now quite a little to see. The UTF-8 is usually used directly.
The default encoding in 2.python3
Python3 default is UTF-8, we use the following code:
Import syssys.getdefaultencoding ()
You can view the default encoding for Python3.
Encode and decode in the 3.python3?
Character encoding in Python3 is often used to decode and encode functions. Especially in crawling Web pages, these two functions are very useful in terms of proficiency. The role of encode allows us to see the intuitive character converted into a byte form within the computer. Decode, on the contrary, converts the characters in byte form into the form we see, intuitively, and "human model human shape".
\X indicates that it is followed by hexadecimal, and \xe4\xb8\xad is the binary 11100100 10111000 10101101. That is, the Chinese character ' encode ' is 11100100 10111000 10101101 in byte form. Similarly, we take 11100100 10111000 10101101 is \xe4\xb8\xad to decode back, is the Chinese character ' in '. The complete should be B ' \xe4\xb8\xad ', in Python3, the string in byte form must be prefixed with B, which is written in the form of the B ' xxxx ' above.
The default encoding for Python3 is UTF-8, so we can see that Python handles these characters in UTF-8. So from what we can see, even if we deliberately encode the characters to UTF-8 by encode (' Utf-8 '), the result is the same: B ' \xe4\xb8\xad '.
Understand this, at the same time we know? UTF-8 compatible with ASCII, we can guess the university often recite the ' A ' corresponds to 65 of ASCII, and here is not the correct decode out of it. The decimal 65 conversion into 16 binary is 41, we try it under:
b‘\x41‘.decode()
The results are as follows. The character ' a ' is sure
4.python3 in the Code conversion
It is said that characters are Unicode-encoded in the computer's memory. It becomes utf-8 only if the character is to be written into a file, stored in a hard disk, or sent from the server to the client (for example, the code at the front of the Web page). But actually I'm concerned about how to show these characters in Unicode bytes, revealing the positive purpose of the Lushan in memory. Here's a demon mirror:
xxxx.encode/decode(‘unicode-escape‘)
B ' \\u4e2d ' or B ' \u4e2d, a slash seems to have no effect. Also found in the Shell window, the direct output of ' \u4e2d ' and input B ' \u4e2d '. Decode (' Unicode-escape ') is the same, will print out the Chinese characters ' in ', instead is ' \u4e2d '. Decode (' Unicode-escape ') will error. Note Python3 not only supports Unicode, but a ' \uxxxx ' format Unicode character used by itself is identified and is equivalent to the STR type.
If we know a Unicode bytecode, how do we become UTF-8 bytecode? Understand the above, now we have ideas, first decode, and then encode. The code is as follows:
?xxx.decode(‘unicode-escape‘).encode()
? The last Extension
Remember that Ord just now? Time changes, Big Brother ASCII was merged, but Ord still has a niche. Try the ord (' Middle ') and the output is 20013. What is 20013, we try Hex (ORD), the output is ' 0x4e2d ', that is, 20013 is the decimal value of x4e2d we met countless times above. Here the next hex, is used to convert into 16 of the function, the person who learned the microcontroller is certainly not unfamiliar to hex.
The last extension of the Internet to see other people's problems. We write down characters similar to ' \u4e2d ', Python3 know what we want to express. But when Python reads a file, it appears ' \u4e2d ', does the computer not know it? Later, the answer was given. As follows:
Import codecsfile = Codecs.open ("A.txt", "R", "Unicode-escape") U = File.read () print (U)
Text content Python3 Coding problem