Python uses Chardet to determine how strings are encoded

Source: Internet
Author: User
This example describes how Python uses Chardet to determine how strings are encoded. Share to everyone for your reference. The specific analysis is as follows:

The recent use of Python to crawl some online data, encountered the problem of coding. Very headache, summarize the solution used.

Linux in Vim view file Encoding command set fileencoding
Python is a powerful code detection package Chardet, the use of the method is very simple. Using PIP install Chardet for simple installation under Linux

Import chardetf = open (' file ', ' R ') Fencoding=chardet.detect (F.read ()) Print fencoding

Fencoding output Format {' confidence ': 0.96630842899499614, ' encoding ': ' GB2312 '}, can only be judged by the probability of a certain encoding. More accurate results. The input parameter is of type Str.

Understanding the encoding of STR in Python allows you to use decode and encode for encoding conversions.

The general process is that STR uses the Decode method to decode the Unicode string type according to the encoding of STR, and then uses encode to convert the Unicode string type to a specific encoding based on a specific encoding. In Python, str and Unicode belong to two different types, as follows.

In general, window default encoding gbk,linux default encoding UTF8
The concept of system coding, Python coding, and file encoding in Python programming.

System code: The default write Source Editor encoding method. It represents all content in the source file is encoded into binary stream according to the word method. Saved to disk. Linux is viewed under the locale command.

Python encoding: Refers to the decoding method set within Python. If not set, Python defaults to the ASCII decoding method. If the python source code file does not appear in Chinese, how to set this place should not be a problem.

Set the method: At the beginning of the source file (must be the first line): #-*-coding:utf-8-*-, the source file is set to decode the way is UTF-8 or

Import sysreload (SYS) sys.setdefaultencoding (' UTF-8 ')

File encoding: Text encoding, under Linux vim using Set fileencoding view.

In general, the output garbled reason is not in accordance with the system decoding the way to encode.

For example, print S, the S type is Str,linux system default code is UTF8 encoding, s before the output should be encoded as UTF8. If S is GBK encoded it should be output like this. Print S.decode (' GBK '). Encode (' UTF8 ') to output Chinese.

window is the same, the window is encoded as GBK encoding, so the s output must be encoded as GBK.

Unicode types are generally processed in Python processing. This can be encoded directly before the output.

Hopefully this article will help you with Python programming.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.