Python beginners, often because of the problem of character encoding, I also read a large number of blogs, and then carried out a certain test, basically make clear the cause and consequences of the coding problem. The following piece of code is on the python3.5, which is explained as an example (please ignore bad variable names).
1 #!/usr/bin/env python2 #-*-coding:utf-8-*-3 4 ImportRe5 Importurllib.request6 7 8 defget_html (URL):9download_page = Urllib.request.urlopen (URL). read (). Decode ('GBK')Ten returnDownload_page One A - defget_image (HTML): -Img_list = Re.findall (r'src= "(. +?\.jpg)"', HTML) thex = 1 - forImg_urlinchimg_list: - Print("is downloading"+ str (x) +"") -Urllib.request.urlretrieve ("https:"+ Img_url.lstrip (),'d:\\list\\%s.jpg'%x) +x + = 1 - +Html_page = get_html ("https://mm.taobao.com/self/aiShow.htm?spm=719.7763510.1998643336.43.xMxXj5&userId=722569871") AGet_image (Html_page)
This is a very simple web image crawler, the code itself has no characteristics and difficulties. Our main concern is the coding problem. Because it is a python3.5 environment, the second line of code can be no.
Most articles on the web do not clearly point to a problem: The coding problem is "code encoding" and "page or file encoding." What do you mean by writing code that takes into account both the code itself and the encoding of the Web page or file object that your code is working on? This is two aspects, can not be confused!
1, first of all, say code encoding:
Code encoding refers to your use of vim, Notepad, UE, IDE, and so on text editing tools or integrated environment, through the keyboard input characters, and save the files stored on the hard disk. In the past python2.x version, the default use of ASCII code, it uses a byte, that is, 8-bit space, only 127 letters/characters stored in English, in layman's terms, it only supports English, does not support Chinese. Therefore, if you enter Chinese in the code, the compiler will error, we often take the code in the second line of the head to encode the way to resolve the problem, that is, #coding:utf-8. In this line, you specify that all characters in the code file are stored using the UTF-8 encoding format, while in Utf-8 it is supported in Chinese, that is to say, you can do this directly: Name= ' Jack ' and Name2= ' Zhang San '. In the python3.x version, Unicode encoding becomes the default encoding, Unicode is also a support for both English and Chinese encoding format, so you can not even the second line of the code statement is not necessary, the use of Chinese directly.
However, #coding: Utf-8 can't solve the coding problem in your Web page and data file, it's only for the code itself!
2, say the page or file object encoding:
In Python code, it is normal to work with data files on Web pages and file systems, such as a variety of crawlers, such as file reads. Above we solved the code of its own encoding, but did not solve the processing object encoding! What do you mean? Take the code at the beginning of the article for example, it is used to crawl a Web page image, and this page is encoded with GBK (this is a Chinese encoding method). You can find out by viewing the source code of the page.
The memory is temporarily present in the GBK encoded format after the page is read. However, the in-memory processing characters are in Unicode encoded format! There is a problem here.
The error message above is caused by a coding conflict. We need to decode!
The decoding method is decode (' encoded format '), which converts other encoding formats to Unicode. such as decode (' GBK '), decode (' Utf-8 '). After decoding the crawled Web page is stored in the Unicode encoding format in memory, and can be normal analysis, matching, search, without error.
We can also encode encode (), which can convert Unicode encoding formats to other encoding formats, typically utf-8.
Summary: In fact, the coding problem is not as complex as it is imagined, as long as we divide the two parts clearly, it is easy to fix. You can use the above code to comment out
Decode (' GBK ') This part of the experiment. There is an incorrect place, please correct me!
Rethinking on the problem of Python coding