This example describes how Python uses Chardet to determine how strings are encoded. Share to everyone for your reference. The specific analysis is as follows:
The recent use of Python to crawl some online data, encountered the problem of coding. Very headache, summarize the solution used.
Linux in Vim view file Encoding command set fileencoding
Python is a powerful code detection package Chardet, the use of the method is very simple. Using PIP install Chardet for simple installation under Linux
Import chardetf = open (' file ', ' R ') Fencoding=chardet.detect (F.read ()) Print fencoding
Fencoding output Format {' confidence ': 0.96630842899499614, ' encoding ': ' GB2312 '}, can only be judged by the probability of a certain encoding. More accurate results. The input parameter is of type Str.
Understanding the encoding of STR in Python allows you to use decode and encode for encoding conversions.
The general process is that STR uses the Decode method to decode the Unicode string type according to the encoding of STR, and then uses encode to convert the Unicode string type to a specific encoding based on a specific encoding. In Python, str and Unicode belong to two different types, as follows.
In general, window default encoding gbk,linux default encoding UTF8
The concept of system coding, Python coding, and file encoding in Python programming.
System code: The default write Source Editor encoding method. It represents all content in the source file is encoded into binary stream according to the word method. Saved to disk. Linux is viewed under the locale command.
Python encoding: Refers to the decoding method set within Python. If not set, Python defaults to the ASCII decoding method. If the python source code file does not appear in Chinese, how to set this place should not be a problem.
Set the method: At the beginning of the source file (must be the first line): #-*-coding:utf-8-*-, the source file is set to decode the way is UTF-8 or
Import sysreload (SYS) sys.setdefaultencoding (' UTF-8 ')
File encoding: Text encoding, under Linux vim using Set fileencoding view.
In general, the output garbled reason is not in accordance with the system decoding the way to encode.
For example, print S, the S type is Str,linux system default code is UTF8 encoding, s before the output should be encoded as UTF8. If S is GBK encoded it should be output like this. Print S.decode (' GBK '). Encode (' UTF8 ') to output Chinese.
window is the same, the window is encoded as GBK encoding, so the s output must be encoded as GBK.
Unicode types are generally processed in Python processing. This can be encoded directly before the output.
Hopefully this article will help you with Python programming.