Solution to Python web crawler garbled problem, python Crawler

Last Update:2017-01-16 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solution to Python web crawler garbled problem, python Crawler

There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encoding conversion, but also garbled processing such as Japanese, Korean, Russian, and Tibetan, because the solution is consistent, it is described here.

Reasons for garbled Web Crawlers

The encoding format of the source webpage is inconsistent with that of the crawled webpage.
For example, if the source webpage is a gbk encoded byte stream, after we capture it, the program uses UTF-8 for encoding and output it to the storage file, this will inevitably cause garbled characters. That is, when the source webpage code is the same as the captured code, the code will not be garbled; at this time, the unified character encoding will not be garbled.

Note

Source Network encoding,
Code directly used by the program B,
Unified conversion character encoding C.

Garbled Solution

Determine the encoding A of the source webpage. encoding A is usually located in three locations of the webpage.

1. Content-Type of http header
The website that obtains the server header can use it to notify the browser of some page content. Content-Type is written as "text/html; charset = UTF-8 ".

2. meta charset

3. Document definition in the webpage Header

<script type="text/javascript"> if(document.charset){  alert(document.charset+"!!!!");  document.charset = 'GBK';  alert(document.charset); } else if(document.characterSet){  alert(document.characterSet+"????");  document.characterSet = 'GBK';  alert(document.characterSet); }

When obtaining the code of the source webpage, you can judge the three data parts in sequence. from the past to the next, the priority is also true.
No encoding information is found in the above three items. Generally, chardet and other third-party web page encoding intelligent identification tools are used

Installation: pip install chardet

Http://chardet.readthedocs.io/en/latest/usage.html

Python chardet character encoding judgment

Chardet can be used to conveniently implement character string/file encoding detection. Although the HTML page has a charset tag, it is sometimes incorrect. Then chardet will help us a lot.
Chardet instance

import urllib rawdata = urllib.urlopen('http://www.bkjia.com/').read() import chardet chardet.detect(rawdata) {'confidence': 0.99, 'encoding': 'GB2312'}

Chardet can directly use the detect function to detect the encoding of the given character. The return value of a function is a dictionary with two elements, one being the credibility of the detection, and the other being the detected encoding.

How does one process Chinese character encoding during self-use crawler development?
All of the following are for python2.7. If not processed, garbled characters are collected. The solution is to process html into a unified UTF-8 encoding and use windows-1252 encoding, chardet encoding recognition training not completed

Import chardet a = 'abc' type (a) str chardet. detect (a) {'confidence ': 1.0, 'encoding': 'ascii'} a = "I" chardet. detect (a) {'confidence ': 0.73, 'encoding': 'windows-1252'}. decode ('windows-1252 ') U' \ xe6 \ u02c6 \ u2018' chardet. detect (. decode ('windows-1252 '). encode ('utf-8') type (. decode ('windows-1252 ') unicode type (. decode ('windows-1252 '). encode ('utf-8') str chardet. detect (. decode ('windows-1252 '). encode ('utf-8') {'confidence ': 0.87625, 'encoding': 'utf-8'} a = "I am a Chinese" type () str {'credentials': 0.9690625, 'encoding': 'utf-8'} chardet. detect (a) #-*-coding: UTF-8-*-import chardet import urllib2 # capture webpage html = urllib2.urlopen ('HTTP: // www.bkjia.com /'). read () print html mychar = chardet. detect (html) print mychar bianma = mychar ['encoding'] if bianma = 'utf-8' or bianma = 'utf-8': html = html. decode ('utf-8', 'ignore '). encode ('utf-8') else: html = html. decode ('gb2312', 'ignore '). encode ('utf-8') print html print chardet. detect (html)

Python code file encoding
By default, py files are ASCII encoded. When Chinese files are displayed, an ASCII code is converted to the default system encoding. An error occurs: SyntaxError: Non-ASCII character. You need to add encoding instructions in the first line of the code file:

#-*-Coding: UTF-8-*-print 'Chinese'

The strings directly entered as above are processed according to the code file encoding 'utf-8 '.
If unicode encoding is used, use the following method:

S1 = u'chinese' # u table shows how to store information in unicode encoding

Decode is a method used by any string to convert the string to unicode format. The parameter indicates the encoding format of the source string.
Encode is also a method of any string. It converts a string to the format specified by the parameter.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More