Python uses Chardet to judge string encoding methods _python

Source: Internet
Author: User
Tags in python

The example in this article describes how Python uses Chardet to determine string encoding. Share to everyone for your reference. The specific analysis is as follows:

The recent use of Python to crawl some of the data online, encountered a coding problem. Very headache, summed up the use of the solution.

Linux in Vim view file Encoding command set fileencoding
A powerful code detection package Chardet in Python is very simple to use. Using PIP install Chardet to implement simple installation under Linux

Import Chardet
f = open (' file ', ' R ')
Fencoding=chardet.detect (F.read ())
print fencoding

Fencoding output Format {' confidence ': 0.96630842899499614, ' encoding ': ' GB2312 '}, can only determine whether it is the probability of some kind of encoding. The results are more accurate. The input parameter is a str type.

By understanding the encoding of STR in Python, you can use decode and encode to implement coded conversions.

The general process is that STR uses the Decode method to decode it to a Unicode string type based on STR encoding, and then converts the Unicode string type to a specific encoding based on a specific encoding using encode. In Python, str and Unicode belong to two different types, as follows.

In general, window default encoding gbk,linux default encoding UTF8
The concept of system coding, Python coding, and file encoding in Python programming.

System code: The default write Source editor of the encoding method. It represents that all content within the source file is encoded into a binary stream according to the word pattern. Saved to disk. Linux is viewed through the locale command.

Python encoding: Refers to the decoding method that is set in Python. If not, Python defaults to the ASCII decoding method. If the python source code file does not appear in Chinese, how to set this place should not be a problem.

Setting method: At the beginning of the source file (must be the first line): #-*-coding:utf-8-*-, the source file is set up to decode the way is UTF-8 or

Import sys
Reload (SYS)
sys.setdefaultencoding (' UTF-8 ')

File encoding: The encoding of the text, Linux under vim using Set fileencoding view.

In general, the reason for the output garbled is that it is not encoded in the way that the system decodes.

For example, the print S, S type for the Str,linux system under the system default encoding for UTF8 encoding, s in the output should be encoded as UTF8. If S is GBK encoding it should be output like this. Print S.decode (' GBK '). Encode (' UTF8 ') can output Chinese.

window under the same conditions, window default encoding is GBK encoding, so s output must be encoded as GBK.

Unicode types are generally handled in Python processing. This way the output can be encoded directly before.

I hope this article will help you with your Python programming.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.