Solution to Python web crawler garbled problem

Source: Internet
Author: User
Tags python web crawler
This article describes in detail how to solve the garbled problem of Python web crawlers, which has some reference value, interested friends can refer to this article to introduce in detail how to solve the garbled problem of Python web crawlers, which has some reference value. interested friends can refer

There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encoding conversion, but also garbled processing such as Japanese, Korean, Russian, and Tibetan, because the solution is consistent, it is described here.

Reasons for garbled Web crawlers

The encoding format of the source webpage is inconsistent with that of The crawled webpage.
For example, if the source webpage is a gbk encoded byte stream, after we capture it, the program uses UTF-8 for encoding and output it to the storage file, this will inevitably cause garbled characters. that is, when the source webpage code is the same as the captured code, the code will not be garbled; at this time, the unified character encoding will not be garbled.

Note

  • SOURCE Network encoding,

  • Code directly used by the Program B,

  • Unified conversion character encoding C.

Garbled solution

Determine the encoding A of the source webpage. encoding A is usually located in three locations of the webpage.

1. Content-Type of http header
The website that obtains the server header can use it to notify the browser of some page content. Content-Type is written as "text/html; charset = utf-8 ".

2. meta charset


3. Document definition in the webpage header

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.