Beautifulsoup Encoding
Beautifulsoup uses Unicode internally. beautifulsoup automatically detects the encoding type of the input file and converts it to Unicode.
Beautifulsoup encoding detection sequence
Beautifulsoup:
- Fromencoding parameter passed during soup object creation;
- Encoding defined by the XML/html file;
- The encoding feature represented by several bytes in the file. The encoding that can be judged at this time can only be one of the following encodings: UTF-*, ebcdic, and ASCII.
- If chardet is installed, beautifulsoup uses chardet to detect file encoding.
- UTF-8
- Windows-1252
In the beautifulsoup source file, there is such a sentenceCode:
Default_output_encoding ="UTF-8"
Note that beautifulsoup uses UTF-8 encoding by default.
Encoding handler in beautifulsoup
The originalencoding variable in beautifulsoup provides the file Encoding Detected by beautifulsoup.
Import Urllib2FromBeautifulsoupImportBeautifulsoup Doc =Urllib2.Urlopen("Http://www.pythonclub.org /")Soup = beautifulsoup(Doc)Soup.Originalencoding# U'utf-8'
Beautifulsoup processes Chinese Encoding