Solution to Python web crawler garbled problem, python Crawler

Source: Internet
Author: User
Tags python web crawler

Solution to Python web crawler garbled problem, python Crawler

There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encoding conversion, but also garbled processing such as Japanese, Korean, Russian, and Tibetan, because the solution is consistent, it is described here.

Reasons for garbled Web Crawlers

The encoding format of the source webpage is inconsistent with that of the crawled webpage.
For example, if the source webpage is a gbk encoded byte stream, after we capture it, the program uses UTF-8 for encoding and output it to the storage file, this will inevitably cause garbled characters. That is, when the source webpage code is the same as the captured code, the code will not be garbled; at this time, the unified character encoding will not be garbled.

Note

  • Source Network encoding,
  • Code directly used by the program B,
  • Unified conversion character encoding C.

Garbled Solution

Determine the encoding A of the source webpage. encoding A is usually located in three locations of the webpage.

1. Content-Type of http header
The website that obtains the server header can use it to notify the browser of some page content. Content-Type is written as "text/html; charset = UTF-8 ".

2. meta charset

<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>

3. Document definition in the webpage Header

<script type="text/javascript"> if(document.charset){  alert(document.charset+"!!!!");  document.charset = 'GBK';  alert(document.charset); } else if(document.characterSet){  alert(document.characterSet+"????");  document.characterSet = 'GBK';  alert(document.characterSet); } 

When obtaining the code of the source webpage, you can judge the three data parts in sequence. from the past to the next, the priority is also true.
No encoding information is found in the above three items. Generally, chardet and other third-party web page encoding intelligent identification tools are used

Installation: pip install chardet

Http://chardet.readthedocs.io/en/latest/usage.html

Python chardet character encoding judgment

Chardet can be used to conveniently implement character string/file encoding detection. Although the HTML page has a charset tag, it is sometimes incorrect. Then chardet will help us a lot.
Chardet instance

import urllib rawdata = urllib.urlopen('http://www.bkjia.com/').read() import chardet chardet.detect(rawdata) {'confidence': 0.99, 'encoding': 'GB2312'} 

Chardet can directly use the detect function to detect the encoding of the given character. The return value of a function is a dictionary with two elements, one being the credibility of the detection, and the other being the detected encoding.

How does one process Chinese character encoding during self-use crawler development?
All of the following are for python2.7. If not processed, garbled characters are collected. The solution is to process html into a unified UTF-8 encoding and use windows-1252 encoding, chardet encoding recognition training not completed

Import chardet a = 'abc' type (a) str chardet. detect (a) {'confidence ': 1.0, 'encoding': 'ascii'} a = "I" chardet. detect (a) {'confidence ': 0.73, 'encoding': 'windows-1252'}. decode ('windows-1252 ') U' \ xe6 \ u02c6 \ u2018' chardet. detect (. decode ('windows-1252 '). encode ('utf-8') type (. decode ('windows-1252 ') unicode type (. decode ('windows-1252 '). encode ('utf-8') str chardet. detect (. decode ('windows-1252 '). encode ('utf-8') {'confidence ': 0.87625, 'encoding': 'utf-8'} a = "I am a Chinese" type () str {'credentials': 0.9690625, 'encoding': 'utf-8'} chardet. detect (a) #-*-coding: UTF-8-*-import chardet import urllib2 # capture webpage html = urllib2.urlopen ('HTTP: // www.bkjia.com /'). read () print html mychar = chardet. detect (html) print mychar bianma = mychar ['encoding'] if bianma = 'utf-8' or bianma = 'utf-8': html = html. decode ('utf-8', 'ignore '). encode ('utf-8') else: html = html. decode ('gb2312', 'ignore '). encode ('utf-8') print html print chardet. detect (html)

Python code file encoding
By default, py files are ASCII encoded. When Chinese files are displayed, an ASCII code is converted to the default system encoding. An error occurs: SyntaxError: Non-ASCII character. You need to add encoding instructions in the first line of the code file:

#-*-Coding: UTF-8-*-print 'Chinese'

The strings directly entered as above are processed according to the code file encoding 'utf-8 '.
If unicode encoding is used, use the following method:

S1 = u'chinese' # u table shows how to store information in unicode encoding

Decode is a method used by any string to convert the string to unicode format. The parameter indicates the encoding format of the source string.
Encode is also a method of any string. It converts a string to the format specified by the parameter.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.