Characters, character sets, encoding, and some problems encountered in python (below ).

Source: Internet
Author: User

Characters, character sets, encoding, and some problems encountered in python (below ).

After reading many blog articles, I have summarized and obtained the following articles. Thank you very much for your selfless dedication!

There is a link to the article referenced in this article at the end of the article. If there is any missing article reference, you can send an email to contact me and then attach the link again!

Intrusion and deletion !!!

This part is the next part, mainly about the encoding part, and some Encoding Problems Encountered in python, which tends to be applied in practice.

The previous article introduced some concepts of character and Character Set, as well as some simple code examples in python, which are biased towards concepts.

Address: http://www.cnblogs.com/echo-coding/p/7435118.html

This is definitely a long history, and it is disgusting for new users (especially for windows )...........

Ii. decode and encode (python encoding)

The character, Character Set, and character encoding are described above to prepare for this summary.

Important concepts:

System Code:The default encoding. Normally, the Windows system defaults to gbk and the Linux system defaults to UTF-8. It can be controlled by locale. getdefalocallocale () and locale. setdefalocallocale (), which are related to encode.

Use the locale module that comes with python to check the default code of the command line (that is, the system code) and set the code of the command line:

It indicates that the internal code of the current system is cp936, which is similar to GBK. In fact, the internal encoding of Chinese XP and WIN7 systems is cp936 (GBK ).

Tips: in linux, the default encoding is utf8. In Windows, the default encoding is gbk.

Python encoding:The decoding method set in python. If this parameter is not set, python uses the ascii decoding method by default. If the python source code file does not contain Chinese characters, how to set this place should be fine.

Permanently set the default encoding of python to UTF-8: Create a New sitecustomize. py file in the Lib \ site-packages folder of python. The content is:

Restart the python interpreter and run sys. getdefaultencoding (), found that the encoding has been set to utf8, after multiple restart, the effect is the same, this is because the system calls this file when starting python, setting the default encoding of the system does not require you to manually add the solution code every time. This is a permanent solution.

File encoding:Text encoding method, sys. getfilesystemencoding ()

Read/write files:

When the built-in open () method is used to open a file, read () reads str. After reading the file, decode () (to unicode) in the correct encoding format is required ). During write () writing, if the parameter is unicode, you need to use the encoding you want to write to encode (). If it is another encoding format of str, decode () using the str encoding, convert it to unicode, and then encode () using the written encoding (). If unicode is directly passed into the write () method as a parameter, Python will first use the character encoding declared in the source code file for encoding and then write.

In addition,The codecs module provides an open () method.You can specify an encoding to open the file. unicode is returned when the file opened by this method is read. When writing, if the parameter is unicode, the encoding specified during open () is used for encoding and writing. If the parameter is str, the encoding is based on the character encoding declared in the source code file, after decoding to unicode, perform the preceding operations. Compared with the built-in open (), this method is not prone to coding problems.Use codecs to directly open the unicode channel.

Encoding in python code (code encoding ):

1,If the character string in python code is not specified for encoding, the default encoding is the same as that in the code file.. For example, str = 'Chinese'. If it is in a UTF-8 encoded code file, the string is UTF-8 encoded. If it is in a gb2312 file, the string is gb2312 encoded. How do I know the code of the code file itself?

(1) Specify the code file encoding by yourself: add the Code File Header"#-*-Coding: UTF-8-*-" to declare that the code file is UTF-8 encoded. At this time, the encoding of the strings not specified is changed to UTF-8.

At the top of the page: #-*-coding: UTF-8-*-currently, it seems to have three functions.

1. This statement is required if the Code contains Chinese comments (otherwise, the Code reports an error and cannot be parsed)

2. a relatively advanced Editor (such as my emacs) uses this as the code file format according to the header declaration.

3. The program will declare through the header to decode the initialization of u "Life is short", such a unicode object (so the header Declaration must be consistent with the storage format of the Code)

(2) When no code file encoding is specified, the default python encoding is used when creating the code file (generally ascii code, which is actually saved as cp936 (GBK) in windows) encoding ). Use sys. getdefaultencoding () and sys. setdefaultencoding ('...') to obtain and set the default encoding.

Terminal input/output encoding:Sys. stdin. encoding, sys. stdout. encoding, must be consistent with locale encoding to print the correct str.

Print will re-convert the code according to sys. stdout. encoding..

Print the display process

When print is called in Python2.7 to print the var variable, the operating system will perform some character processing on var: If var is a str type variable, it will be directly delivered to the terminal for display; if the var variable is of the unicode type, the operating system first encodes var into an object of the str type (the encoding format depends on the encoding format of stdout), and then displays it on the terminal. If the encoding method of str variables is inconsistent with the encoding method set by the terminal during display on the terminal, garbled characters may occur.

The Encoding Error occurs when the print string is used. The reason is sys. stdout. encoding. The encoding type of the string object after print must be the same as that specified by sys. stdout. encoding. Otherwise, an encoding Error occurs.

The console cannot display Chinese characters normally. The console encoding is determined by the operating system (in windows );

My operating system is Windows 8 (GBK)

The console encoding determines the value of sys. stdout. encoding. sys. stdout. encoding = 'cp936'

Decode & encode:

Decode: Decoding (from other directions (UTF-8, gbk, etc. ......) to unicode)

Encode: encoding (From unicode to other directions (UTF-8, gbk, etc .......))

Simply put, you must use the encoding rules to decode them. Otherwise, the code will be garbled. Otherwise, an error will be reported directly, which cannot be solved !!!

But the problem is that the system has a default encoding format. Your file is obviously UTF-8 encoded, but it is decoded in gbk mode. Otherwise, it cannot be solved directly.

About printing:

When you print str, you actually send the byte stream directly to the shell. If your byte stream encoding format is different from the shell encoding format, it will be garbled.

When you print unicode, the system automatically encodes it into the shell encoding format without garbled characters.

Other commands:

File System encoding: sys. getfilesystemencoding ()

Terminal input code: sys. stdin. encoding

Terminal output code: sys. stdout. encoding

Some suggestions:

1. Use the character encoding declaration, and all source code files in the same project use the same character encoding Declaration;

2. Discard str and use unicode all: press u before quotation marks to reduce the encoding problem by 90%;

3. Use codecs. open () to replace the built-in open ();

4. Absolute need to avoid using character encoding: MBCS/DBCS and UTF-16;

5. Set defaultencoding. (The default value is ascii );

6. The code file must be saved in the same format as # coding: xxx in the file header.

Others:

The major difference between python 3 and python 2 is that python itself uses unicode encoding by default,Strings are no longer distinguished from "abc" and u "abc". Strings "abc" are unicode by default and do not represent local encoding.

Python2.7 and later do not use setdefaultencoding. There is no difference between the two (declare the header and setdefaultencoding ).

These two functions are different:

1. # coding: the function of UTF-8 isDefine source code encoding. If not defined, the source code cannot contain Chinese strings;

2. sys. getdefaultencoding () is the default string encoding format.

Root Cause: string in Python2

Python has done many tricky tasks to make its syntax look concise and easy to use. obfuscation of byte string and text string is one of them.

In Python, there are three main types of string, unicode (text string), str (byte string, binary data), and basestring, which are the parent classes of the first two.

In fact, in the field of language design, whether a string of bytes should be considered as a string is controversial. The well-known Java and C # have voted against, while Python has stood in the supporter's camp. In fact, in many cases, operations for text, such as regular expression matching and character replacement, are not required for bytes. Python considers bytes as characters, so their operation sets are consistent.

Next, Python will try to perform automatic type conversion on bytes if necessary, for example, in the = above, or when merging bytes and text. Without an encoding, the conversion between two different types cannot be performed. Therefore, Python requires a default encoding. In the age of Python2, ASCII was the most popular (you can say so), so Python2 chose ASCII. However, as we all know, ASCII is useless in scenarios requiring conversion (128 characters, what to eat ).

After so many years of discussion, Python 3 finally learned well. The default encoding is Unicode, which means that the conversion can be correct and successful in all scenarios that require conversion.

What is confusing:

Open ipython and run it at the beginning:

After the task is finished, run a py (such as tb. py) file, and then the magic thing happens:

So far, there is no solution ......

References and blogs:

Http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

Http://blog.chinaunix.net/uid-200142-id-4461708.html

Http://www.cnblogs.com/evening/archive/2012/04/19/2457440.html

Http://blog.csdn.net/olanlanxiari/article/details/48201231

Http://www.jb51.net/article/87739.htm

Http://www.cnblogs.com/JohnABC/p/4015504.html

Https://m.baidu.com/pu=sz@1321_2001/from=0/bd_page_type=1/ssid=0/uid=0/pu=sz%401321_2001%2Cta%40utouch_1_10.2_3_602/baiduid=F9234C37D7B29B954D4706B9B57F904E/w=0_10_unicode+gbk%E7%BC%96%E7%A0%81/t=wap/l=3/tc? Ref = www_utouch & lid = 13436452897767398859 & order = 3 & vit = osres & tj = www_normal_3_0_10_title & m = 8 & srd = 1 & dict = 20 & title = % E6 % B7 % B1 % E5 % 85% A5 % E7 % 90% 86% E8 % A7 % A3-% E5 % AD % 97% E7 % AC % A6 % E7 % BC % 96% E7 % A0 % 81 ASCII % 2CGB2312% 2 CGBK % 2 CUnicode % 2CUTF-8 -... & sec = 21874 & di = 38f964cdb7b434e5 & bdenc = 1 & nsrc = Beijing

Https://m.baidu.com/pu=sz@1321_2001/from=0/bd_page_type=1/ssid=0/uid=0/pu=sz%401321_2001%2Cta%40utouch_1_10.2_3_602/baiduid=F9234C37D7B29B954D4706B9B57F904E/w=0_10_unicode+gbk%E7%BC%96%E7%A0%81/t=wap/l=3/tc? Ref = www_utouch & lid = 13436452897767398859 & order = 1 & vit = osres & tj = www_normal_1_0_10_title & m = 8 & srd = 1 & dict = 30 & title = ASCIIUnicodeGBK % E5 % 92% 8CUTF-8% E5 % AD % 97% E7 % AC % A6 % E7 % BC % 96% E7 % A0 % 81% E7 % 9A % 84% E5 % 8C % BA % E5 % 88% AB % E8 % 81% 94... _ % E5 % 8D % 9A % E5 % AE % A2 % E5 % 9B % AD & sec = 21874 & di = 265cf5a20e05fae9 & bdenc = 1 & nsrc = Beijing

Http://www.cnblogs.com/work115/p/5924446.html

Https://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.