In recent days, when the acquisition of a page most of the Web page OK, a small number of pages garbled problems, debugging a few days, and finally found that there are some illegal characters caused. Hereby record
1. Under normal circumstances. can use
Import Chardet
thischarset = Chardet.detect (STRs) ["Encoding"]
To get the encoding of the file or page
or directly grab the page charset = xxxx to get
2. The encoding that is specified when a special character is encountered in the content will cause garbled characters . That is, the illegal character in the content is caused, can use the code to ignore the illegal character of the way to deal with.
STRs = Strs.decode ("UTF-8", "ignore"). Encode ("UTF-8")
The second parameter of decode represents the way in which illegal characters are encountered
This parameter defaults to throwing an exception.
The above is a small series for you to bring Python collection of Chinese garbled problem of the perfect solution to all the content, I hope to help you, a lot of support cloud Habitat Community ~