This article describes in detail how to solve the garbled problem of Python web crawlers, which has some reference value, interested friends can refer to this article to introduce in detail how to solve the garbled problem of Python web crawlers, which has some reference value. interested friends can refer
There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encoding conversion, but also garbled processing such as Japanese, Korean, Russian, and Tibetan, because the solution is consistent, it is described here.
Reasons for garbled Web crawlers
The encoding format of the source webpage is inconsistent with that of The crawled webpage.
For example, if the source webpage is a gbk encoded byte stream, after we capture it, the program uses UTF-8 for encoding and output it to the storage file, this will inevitably cause garbled characters. that is, when the source webpage code is the same as the captured code, the code will not be garbled; at this time, the unified character encoding will not be garbled.
Note
Garbled solution
Determine the encoding A of the source webpage. encoding A is usually located in three locations of the webpage.
1. Content-Type of http header
The website that obtains the server header can use it to notify the browser of some page content. Content-Type is written as "text/html; charset = utf-8 ".
2. meta charset
3. Document definition in the webpage header