Python3 Coding Problem Summary _python

Source: Internet
Author: User
Tags ord

This two days to write a monitoring web crawler, the role is to track a Web page changes, but ran a night a problem .... I hope you will not hesitate to enlighten us!
I'm using Python3, the error is thrown at the decode of the HTML response, and the code is as follows:

Response = Urllib.urlopen (dsturl)
content = Response.read (). Decode (' Utf-8 ')

Throw an error to

 File "./unxingcrawler_p3.py", line, in getnewphones
  content = Response.read (). Decode ()
Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb2 in position 24137:invalid start byte

Before the operation is no problem, after a night appeared .... The last thing to understand is why utf-8 cannot be resolved in a Web page that it declares to be utf-8 encoded.

Later after the enthusiastic Netizen's reminder, only then discovers needs to usedecode('utf-8', 'ignore')

To get a thorough understanding of Python's coding problems, share the following, hoping to help you understand Python's coding problems

1. From the byte to say:

A byte consists of eight bits, each bit representing 0 or 1, and a byte can represent a total of 2^8=256 digits from 00000000 to 11111111. An ASCII encoding uses a byte (excluding the highest bit of the byte as the Tochi parity bit), and the ASCII encoding actually uses 7 bits in a byte to represent the characters, representing the 2^7=128 characters altogether. For example, the ASCII encoding of 01000001 (that is, decimal 65) means that the character ' a ', 01000001 plus 32 01100001 (that is, decimal 97) represents the character ' a '. Now that Python is turned on and the CHR and ORD functions are invoked, we can see that Python has converted the ASCII encoding for us. As shown in figure

The first 00000000 represents a null character, so the ASCII encoding actually contains only 127 characters, such as letters, punctuation marks, special symbols, and so on. Since ASCII was born in the United States, it is enough for the English word that is made up of letters and words. But the Chinese, the Japanese, the Koreans and other languages are not satisfied. Chinese is a word a word, ASCII coding with all the 256 characters are not enough.

As a result, Unicode encoding was subsequently present. Unicode encoding is usually composed of two bytes, representing 256*256 characters, called UCS-2. Some remote words will also use four bytes, known as UCS-4. That means the Unicode standard is still evolving. But UCS-4 appear less, we first remember: the most original ASCII encoding using a byte encoding, but because of language differences in the number of characters, people use two bytes, there is a unified, multilingual Unicode encoding.

In Unicode, the 127 characters in the original ASCII only need to be preceded by a full zero byte, such as the character ' a ': 01100001, which is referred to in the preceding text, and becomes 00000000 01100001 in Unicode. Soon, the Americans are unhappy, eating the pot of the world's Nation, originally only a byte can transfer English now become two bytes, very wasteful storage space and transmission speed.

People are more intelligent, so there is a UTF-8 code. Because of the problem of space wasting, this UTF-8 encoding is variable in length , from one byte of the English alphabet to the usual three bytes in Chinese, and then to six bytes of some obscure words. To solve the space problem, UTF-8 coding has a magical additional feature that is compatible with the ASCII encoding of Big Brother. Some of the old software is now able to continue working in UTF-8 coding.

Note that Chinese characters are usually different in Unicode encoding and UTF-8 encoding except for the same English alphabet. For example, the word ' medium ' in Unicode is 01001110 00101101, whereas in UTF-8 encoding it is 11100100 10111000 10101101.

Our Motherland mother naturally also has its own set of standards. That's GB2312 and GBK. Of course it's pretty rare to see. The UTF-8 is usually used directly.

Default encoding in 2.python3

The default in Python3 is UTF-8, we pass the following code:

Import sys

sys.getdefaultencoding ()

You can view the default encoding for Python3.

Encode and decode in the 3.python3

Character encoding in Python3 is often used to decode and encode functions. Especially in crawling Web pages, these two functions are very good at using proficiency. The role of encode enables us to see the intuitive character converted into a byte form within the computer. Decode, on the contrary, converts the character of a byte form into a form that we can see, intuitively, "human".

\x that the following is hexadecimal, \xe4\xb8\xad is binary 11100100 10111000 10101101. In other words, the Chinese character ' encode ' is in the form of a byte, which is 11100100 10111000 10101101. Similarly, we take 11100100 10111000 10101101 is \xe4\xb8\xad to decode back, is the Chinese character '. The complete should be B ' \xe4\xb8\xad ', in Python3, the string in byte form must prefix B, which is written in the form of B ' xxxx '.

The default encoding for Python3 is UTF-8, so we can see that Python handles these characters with UTF-8. So, as you can see from the image above, even if we utf-8 the character encode to the UTF-8 code by encode, the result is the same: B ' \xe4\xb8\xad '.

With this in mind, and we know that UTF-8 is compatible with ASCII, we can assume that the ' A ' corresponding to 65 in ASCII, which is often recited in college, can be decode right here. The decimal 65 converted into 16 is 41, and we try to:

b'\x41'.decode()

The results are as follows. Sure enough, the character ' A '

The encoding conversion in 4.python3

It is said that characters are unified in the computer's memory and are encoded in Unicode. It becomes utf-8 only if the character is written to a file, stored in a hard disk, or sent from the server to the client (such as the code at the front of the Web page). But I'm actually more concerned about how to display these characters in Unicode byte form, revealing its positive purpose in memory of Lushan. Here's a devotion:

xxxx.encode/decode('unicode-escape')

B ' \\u4e2d ' or B ' \u4e2d, a slash seems to have no effect. At the same time, it can be found that in the shell window, the direct transmission of ' \u4e2d ' and input B ' \u4e2d '. Decode (' Unicode-escape ') is the same, will print out the "Chinese", but is ' \u4e2d '. Decode (' Unicode-escape ') will complain. The description Python3 not only supports Unicode, but also a ' \uxxxx ' format Unicode character characters is identified and is equivalent to the STR type.

If we know a Unicode byte code, how to become UTF-8 byte code? Understand these, now we have ideas, first decode, and then encode. The code is as follows:

​xxx.decode('unicode-escape').encode()

The final extension

Remember that Ord just now? The Times change, Big Brother ASCII was merged, but Ord still have a useful. Try Ord (' Medium ') and the output is 20013. What is 20013, we try Hex (' ord '), the output is ' 0x4e2d ', that is, 20013 is the decimal value of the x4e2d we met countless times above. Here next hex, is used to convert into the function of 16 into the system, the person who has learned the monolithic microcomputer to hex certainly will not be unfamiliar.

Last extension, see other people's problems on the Internet. We write down characters similar to ' \u4e2d ', Python3 know what we want to say. But when Python reads a file, it appears ' \u4e2d ', does the computer not recognize it? Later, someone gave the answer. As follows:

Import codecs

file = Codecs.open ("A.txt", "R", "Unicode-escape")

u = file.read ()

Print (U)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.