During the development of self-use crawlers, some webpages are UTF-8, some are gb2312, and some are gbk. If no processing is added, all the collected webpages are garbled, the solution is to process html into a uniform UTF-8 during development and self-use Crawler. some webpages are UTF-8, some are gb2312, and some are gbk. If no processing is performed, all collected data is garbled, the solution is to process html into a unified UTF-8 encoding.
Python2.7
# Coding: utf-8import chardet # Capture Web htmlline = "http://www.pythontab.com" html_1 = urllib2.urlopen (line, timeout = 120 ). read () encoding_dict = chardet. detect (html_1) print encodingweb_encoding = encoding_dict ['encoding'] # for processing, the entire html will not be garbled. If web_encoding = 'utf-8' or web_encoding = 'utf-8': html = html_1else: html = html_1.decode ('gbk', 'ignore '). encode ('utf-8 ')