Python processes Chinese characters (UTF-8, gbk, and unicode) and reprints them,
How Python processes Chinese characters (UTF-8, gbk, and unicode)
Reprinted from: http://blog.csdn.net/chixujohnny/article/details/51782826
The first line of the file is always default
# coding: utf-8
1. What is UTF-8/gbk/unicode encoding?
Let me explain it in a simple and easy way. It's complicated. No one will check it.
UTF-8 is a common encoding in Unix. It can be used to encode Chinese characters. It should be the unique encoding for Chinese characters that can be opened in Unix (garbled characters)
Gbk is a Chinese character encoding in windows, where GB2312 encoding is also gbk encoding. This encoding is garbled in Unix environments, probably like this:
As you can see, English is displayed normally, but Chinese characters are gg. Generally, the same character as an egg is gbk (only tested on mac, other Unix do not know if it is an egg)
Unicode is a binary encoding. All UTF-8 and gbk encoding must be translated using unicode encoding. To put it bluntly, UTF-8 and gbk encoding cannot be directly converted, conversion can only be performed once in unicode.
The following is an illustration for ease of understanding:
In mac, pycharm can only display unicode Chinese characters. For example:
# Coding: UTF-8 s = 'I am a string of Chinese characters' print s
First, s is a string of UTF-8 encoded Chinese characters. During print, UTF-8 is converted to unicode and then output to the displayed Chinese characters.
2. How to view documents and string encoding formats
First install the chardet Module
Pip install chardet
# Coding: UTF-8 import chardet s = 'haha I am a tested Chinese character. 'print chardet. detect (s)
Output: {'notificence ': 0.99, 'encoding': 'utf-8 '}
This method can only output the possible encoding format of this character. We can see that 0.99 may be UTF-8, which is actually UTF-8 encoding. As long as the string is long enough, the subsequent confidence level is 0.99.
Of course, you can also use the file command
file -i mssql.py mssql.py: text/x-java; charset=utf-8
3. How to convert various encodings
Python has two useful functions: decode () and encode ()
Decode ('utf-8') is converted from UTF-8 encoding to unicode encoding. Of course, you can also write 'gbk' in brackets'
Encode ('gbk') is to compile unicode encoding into gbk encoding. Of course, you can also write 'utf-8' in brackets'
If I know that a string is encoded in UTF-8, how can it be converted to gbk?
s.decode('utf-8').encode('gbk')
Illustration:
4. Why should I transfer the encoding?
When using NLPIR for word segmentation, there are strict requirements on the encoding format of the input document. You can set the encoding format of the Input Source Document during function initialization.
However, the encoding of the source document may be UTF-8 later than gbk, which requires a uniform format. If the format is not messy, an error is returned,
I will write a description about how to call NLPIR in python later.
System Code of centos6 System
Cat/etc/sysconfig/i18n
LANG ="En_US.UTF-8"
SYSFONT = "latarcyrheb-sun16"
My. cnf of mysql
Character_set_server = utf8mb4
Default_character_set = utf8
SQL Server2014 character encoding. SQL Server does not use the UTF-8 character encoding statement, but only uses sorting rules.
If anything is wrong, you are welcome to make a brick o