Solution to Python web crawler garbled problem, python Crawler
There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encoding conversion, but also garbled processing such as Japanese, Korean, Russian, and Tibetan, because the solution is consistent, it is described here.
Reasons for garbled Web Crawlers
The encoding format of the source webpage is inconsistent with that of the crawled webpage.
For example, if the source webpage is a gbk encoded byte stream, after we capture it, the program uses UTF-8 for encoding and output it to the storage file, this will inevitably cause garbled characters. That is, when the source webpage code is the same as the captured code, the code will not be garbled; at this time, the unified character encoding will not be garbled.
Note
- Source Network encoding,
- Code directly used by the program B,
- Unified conversion character encoding C.
Garbled Solution
Determine the encoding A of the source webpage. encoding A is usually located in three locations of the webpage.
1. Content-Type of http header
The website that obtains the server header can use it to notify the browser of some page content. Content-Type is written as "text/html; charset = UTF-8 ".
2. meta charset
<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>
3. Document definition in the webpage Header
<script type="text/javascript"> if(document.charset){ alert(document.charset+"!!!!"); document.charset = 'GBK'; alert(document.charset); } else if(document.characterSet){ alert(document.characterSet+"????"); document.characterSet = 'GBK'; alert(document.characterSet); }
When obtaining the code of the source webpage, you can judge the three data parts in sequence. from the past to the next, the priority is also true.
No encoding information is found in the above three items. Generally, chardet and other third-party web page encoding intelligent identification tools are used
Installation: pip install chardet
Http://chardet.readthedocs.io/en/latest/usage.html
Python chardet character encoding judgment
Chardet can be used to conveniently implement character string/file encoding detection. Although the HTML page has a charset tag, it is sometimes incorrect. Then chardet will help us a lot.
Chardet instance
import urllib rawdata = urllib.urlopen('http://www.bkjia.com/').read() import chardet chardet.detect(rawdata) {'confidence': 0.99, 'encoding': 'GB2312'}
Chardet can directly use the detect function to detect the encoding of the given character. The return value of a function is a dictionary with two elements, one being the credibility of the detection, and the other being the detected encoding.
How does one process Chinese character encoding during self-use crawler development?
All of the following are for python2.7. If not processed, garbled characters are collected. The solution is to process html into a unified UTF-8 encoding and use windows-1252 encoding, chardet encoding recognition training not completed
Import chardet a = 'abc' type (a) str chardet. detect (a) {'confidence ': 1.0, 'encoding': 'ascii'} a = "I" chardet. detect (a) {'confidence ': 0.73, 'encoding': 'windows-1252'}. decode ('windows-1252 ') U' \ xe6 \ u02c6 \ u2018' chardet. detect (. decode ('windows-1252 '). encode ('utf-8') type (. decode ('windows-1252 ') unicode type (. decode ('windows-1252 '). encode ('utf-8') str chardet. detect (. decode ('windows-1252 '). encode ('utf-8') {'confidence ': 0.87625, 'encoding': 'utf-8'} a = "I am a Chinese" type () str {'credentials': 0.9690625, 'encoding': 'utf-8'} chardet. detect (a) #-*-coding: UTF-8-*-import chardet import urllib2 # capture webpage html = urllib2.urlopen ('HTTP: // www.bkjia.com /'). read () print html mychar = chardet. detect (html) print mychar bianma = mychar ['encoding'] if bianma = 'utf-8' or bianma = 'utf-8': html = html. decode ('utf-8', 'ignore '). encode ('utf-8') else: html = html. decode ('gb2312', 'ignore '). encode ('utf-8') print html print chardet. detect (html)
Python code file encoding
By default, py files are ASCII encoded. When Chinese files are displayed, an ASCII code is converted to the default system encoding. An error occurs: SyntaxError: Non-ASCII character. You need to add encoding instructions in the first line of the code file:
#-*-Coding: UTF-8-*-print 'Chinese'
The strings directly entered as above are processed according to the code file encoding 'utf-8 '.
If unicode encoding is used, use the following method:
S1 = u'chinese' # u table shows how to store information in unicode encoding
Decode is a method used by any string to convert the string to unicode format. The parameter indicates the encoding format of the source string.
Encode is also a method of any string. It converts a string to the format specified by the parameter.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.