Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,

Last Update:2017-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How Python processes Chinese characters (UTF-8, gbk, and unicode)

Reprinted from: http://blog.csdn.net/chixujohnny/article/details/51782826

The first line of the file is always default

# coding: utf-8

1. What is UTF-8/gbk/unicode encoding?

Let me explain it in a simple and easy way. It's complicated. No one will check it.
UTF-8 is a common encoding in Unix. It can be used to encode Chinese characters. It should be the unique encoding for Chinese characters that can be opened in Unix (garbled characters)
Gbk is a Chinese character encoding in windows, where GB2312 encoding is also gbk encoding. This encoding is garbled in Unix environments, probably like this:

As you can see, English is displayed normally, but Chinese characters are gg. Generally, the same character as an egg is gbk (only tested on mac, other Unix do not know if it is an egg)
Unicode is a binary encoding. All UTF-8 and gbk encoding must be translated using unicode encoding. To put it bluntly, UTF-8 and gbk encoding cannot be directly converted, conversion can only be performed once in unicode.
The following is an illustration for ease of understanding:

In mac, pycharm can only display unicode Chinese characters. For example:

# Coding: UTF-8 s = 'I am a string of Chinese characters' print s

First, s is a string of UTF-8 encoded Chinese characters. During print, UTF-8 is converted to unicode and then output to the displayed Chinese characters.

2. How to view documents and string encoding formats
First install the chardet Module
Pip install chardet

# Coding: UTF-8 import chardet s = 'haha I am a tested Chinese character. 'print chardet. detect (s)

Output: {'notificence ': 0.99, 'encoding': 'utf-8 '}
This method can only output the possible encoding format of this character. We can see that 0.99 may be UTF-8, which is actually UTF-8 encoding. As long as the string is long enough, the subsequent confidence level is 0.99.

Of course, you can also use the file command

file -i mssql.py mssql.py: text/x-java; charset=utf-8

3. How to convert various encodings

Python has two useful functions: decode () and encode ()
Decode ('utf-8') is converted from UTF-8 encoding to unicode encoding. Of course, you can also write 'gbk' in brackets'
Encode ('gbk') is to compile unicode encoding into gbk encoding. Of course, you can also write 'utf-8' in brackets'
If I know that a string is encoded in UTF-8, how can it be converted to gbk?

s.decode('utf-8').encode('gbk')

Illustration:

4. Why should I transfer the encoding?

When using NLPIR for word segmentation, there are strict requirements on the encoding format of the input document. You can set the encoding format of the Input Source Document during function initialization.
However, the encoding of the source document may be UTF-8 later than gbk, which requires a uniform format. If the format is not messy, an error is returned,
I will write a description about how to call NLPIR in python later.

System Code of centos6 System

Cat/etc/sysconfig/i18n
LANG ="En_US.UTF-8"
SYSFONT = "latarcyrheb-sun16"

My. cnf of mysql

Character_set_server = utf8mb4
Default_character_set = utf8

SQL Server2014 character encoding. SQL Server does not use the UTF-8 character encoding statement, but only uses sorting rules.

If anything is wrong, you are welcome to make a brick o

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support