Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,

Source: Internet
Author: User

Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,

How Python processes Chinese characters (UTF-8, gbk, and unicode)

Reprinted from:


The first line of the file is always default

# coding: utf-8  



1. What is UTF-8/gbk/unicode encoding?

Let me explain it in a simple and easy way. It's complicated. No one will check it.
UTF-8 is a common encoding in Unix. It can be used to encode Chinese characters. It should be the unique encoding for Chinese characters that can be opened in Unix (garbled characters)
Gbk is a Chinese character encoding in windows, where GB2312 encoding is also gbk encoding. This encoding is garbled in Unix environments, probably like this:

As you can see, English is displayed normally, but Chinese characters are gg. Generally, the same character as an egg is gbk (only tested on mac, other Unix do not know if it is an egg)
Unicode is a binary encoding. All UTF-8 and gbk encoding must be translated using unicode encoding. To put it bluntly, UTF-8 and gbk encoding cannot be directly converted, conversion can only be performed once in unicode.
The following is an illustration for ease of understanding:

In mac, pycharm can only display unicode Chinese characters. For example:

# Coding: UTF-8 s = 'I am a string of Chinese characters' print s

First, s is a string of UTF-8 encoded Chinese characters. During print, UTF-8 is converted to unicode and then output to the displayed Chinese characters.



2. How to view documents and string encoding formats
First install the chardet Module
Pip install chardet

# Coding: UTF-8 import chardet s = 'haha I am a tested Chinese character. 'print chardet. detect (s)

Output: {'notificence ': 0.99, 'encoding': 'utf-8 '}
This method can only output the possible encoding format of this character. We can see that 0.99 may be UTF-8, which is actually UTF-8 encoding. As long as the string is long enough, the subsequent confidence level is 0.99.

Of course, you can also use the file command

file -i text/x-java; charset=utf-8



3. How to convert various encodings

Python has two useful functions: decode () and encode ()
Decode ('utf-8') is converted from UTF-8 encoding to unicode encoding. Of course, you can also write 'gbk' in brackets'
Encode ('gbk') is to compile unicode encoding into gbk encoding. Of course, you can also write 'utf-8' in brackets'
If I know that a string is encoded in UTF-8, how can it be converted to gbk?




4. Why should I transfer the encoding?

When using NLPIR for word segmentation, there are strict requirements on the encoding format of the input document. You can set the encoding format of the Input Source Document during function initialization.
However, the encoding of the source document may be UTF-8 later than gbk, which requires a uniform format. If the format is not messy, an error is returned,
I will write a description about how to call NLPIR in python later.



System Code of centos6 System

SYSFONT = "latarcyrheb-sun16"


My. cnf of mysql

Character_set_server = utf8mb4
Default_character_set = utf8


SQL Server2014 character encoding. SQL Server does not use the UTF-8 character encoding statement, but only uses sorting rules.



If anything is wrong, you are welcome to make a brick o

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.