The coding methods and problems of win, PY and notepad++

Last Update:2016-07-10 Source: Internet

Author: User

Tags uppercase letter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The coding methods and problems of win, PY, notepad++ first to say the conclusion:

Because the default encoding used by Win CMD is GBK (ANSI), if it encounters bat or Python's Chinese needs to be displayed in cmd, if garbled, first check whether because it is not 1. GBK encoded in Chinese or 2. is converted to GBK encoding or 3 in code. In Python code, it is a Unicode object .
do not in win under the text document editor with your own utf-8 encoded file and save , because win by default will be in the UTF-8 encoded content in the BOM, just see no problem because not saved will not automatically add BOM, but if saved, win will automatically add BOM
Note that the notepad++ setting, that is Settings -> Preferences -> New default encoding method for new documents, recommended or in accordance with the win default ANSI save , otherwise created UTF-8 document if the document editor to open, will automatically add BOM, resulting in some unknown run errors
For under win with notepad++ write. Py does not have to be opened edit caused by the BOM problem, but because sometimes it will need to be displayed in cmd, so it is best not to use UTF-8 encoding, ANSI is GBK code bar
The meaning of the preceding or equal representation of the Python code is to the Python interpreter, and the conclusion is that it is best to match the encoding of the#!coding=utf-8#!coding=gbkdocument itself, which is#!coding=gbkbest for the Python code under win.

OK, finish the above conclusion, start to say the specific details.

The reason for writing this document is because of a problem with notepad++ writing Python code:

1 Previously summary:

The Python script plugin is installed in notepad++ so that you can use notepad++ as an IDE editor.
Because some of the Python code that was written earlier involves the Chinese path, as well as some Chinese comments, it is used directly in the notepad++#!coding=utf-8, but the. py file itself is encoded in the default GBK.
Previously set in the notepad++ settings Use utf-8 to open ANSI files, this is a very strange feature, that is, the file itself is encoded in the way or GBK, but notepad++ Open after the first transcoding to Utf-8 and display, and then in the save time or the original way, that is, GBK encoding save, Causes a series of misunderstandings behind the analysis, which is explained in detail later.

2 situations where the problem arises:

It is no problem that the Python code of this "coding relationship disorder" is directly edited in notepad++ and run using Python script or directly using the Python interpreter, which means that the Chinese path can be recognized, and can correctly read the files in the path, etc.
Using Pycharm to open this code, the Chinese in the display code in the IDE becomes garbled and cannot be run.

3 Analysis of the problem: encoding of 3.1 characters

* ASCII code *

Eight bits (bit) can be combined out of 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

Non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols. However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0–127 represents the same symbol, not just the 128–255 section.

As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols. The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8 .

Unicode

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols. Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table. It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable. They result in: 1) There is a variety of Unicode storage methods, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

UTF-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 (characters in two-byte or four-byte notation) and UTF-32 (characters in four-byte notation), but not on the Internet. Again , the relationship here is that UTF-8 is one of the ways Unicode is implemented. one of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol. The coding rules for UTF-8 are simple, with only two lines:
1. For a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2. For n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

5. Unicode symbol range | UTF-8 encoding
6. (hex) | (binary)
7. --------------------+--------------------------- ------------------
8. 0000 0000-0000 007F | 0xxxxxxx
9. 0000 0080-0000 07FF | 110xxxxx 10xxxxxx
10. 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
11. 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.

below, or take the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 encoding. known as "strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.

Encoding and encoding format conversion under win

Using the example in the previous section, you can see that the Unicode code for "strict" is 4e25,utf-8 encoding is E4B8A5, and the two are not the same. The conversion between them can be achieved through WIN's Notepad.

After opening the file, click "Save as" on the "File" menu, you will get out of a dialog box, at the bottom there is a "coded" drop-down bar.

There are four options: Ansi,unicode,unicode big endian and UTF-8.
1. ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).
2. Unicode encoding refers to the UCS-2 encoding, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format. (That is , win under this conversion is little endian, but from the reading order of this is reversed, as explained below )
3. The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.
4. UTF-8 encoding, which is the encoding method mentioned in the previous section.
5. After selecting "Encode mode", click "Save" button, the file encoding method will be converted immediately.

As mentioned in the previous section of Little endian and big endian, Unicode codes can be stored directly in the UCS-2 format. Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25. storage, 4E in front, 25 in the back, is the big endian way, 25 in front, 4E in the back, is little endian way. Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).

Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file? Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.
1. If the first two bytes of a text file are Fe FF, it means that the file is in a large head;
2. If the first two bytes are FF FE, it means that the file is in a small way.

Try "I" for example:

**big edidon FE FF**

Me:\u 62 11

o: \u 00 6F

**small edidon FF FE**

Me:\u 11 62

o: \u 6F 00

**utf-8**

Me: E6 88 91

o: 6F

PS: For the win system, the bom identifier will appear, which is EF BB BF

**ANSI (for win system, the default gbk or gb2312 mode)**
Me: CE D2
o: 6F

PS: There are some Web sites with this kind of text encoding, but the conversion of the Web is often error-prone because of the browser's coding problems, which can be tested

3.2 Encoding of characters in notepad++

Now that I've learned about these encodings, what's the format of the document I saved as GBK#！coding=utf-8, trying to view it using a program that looks at the binary encoding of the file, and then discovering a strange phenomenon-dragging the file itself to the binary viewer, such as beyond Compare, the display is GBK encoding (because the Chinese "i" is the CE D2, but in the notepad++ with the hex plugin but found that the Utf-8 code is displayed. At first, I thought it was notepad++. Automatically adjusts the encoding of the characters based on the header of the file#！coding=utf-8, and the truth is that it was The default encoding method for new documents uses utf-8 to open gbk encoded files.selected in the first of the articles.

Therefore, all files encoded as GBK in notepad++ are turned on after the display is converted to UTF-8 encoding, so the hex plugin in notepad++ shows that the file is Utf-8 encoded and that clicking in the notepad++以ANSI显示will appear garbled But when you quit, you revert to GBK.

Solution: Cancel the above mentioned option, see the beginning of this article

3.3 Encoding of characters in Python

In this section, "Ha" is interpreted as an example to explain all the problems, and the various encodings of "Ha" are as follows:

 
UNICODE (UTF8-16):       C854；
UTF-8:                              E59388；
GBK:                                 B9FE。

3.3.1 Str and Unicode in Python

What exactly is str and Unicode in Python? Referring to Unicode in Python, generally refers to Unicode objects, such as the ' Haha ' Unicode object,u‘\u54c8\u54c8‘and STR, which is a byte array that represents the encoding of Unicode objects (which can be utf-8, GBK, cp936, GB2312) After the format of the storage. Here it is just a stream of words, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display. For example:

For Unicode objects haha encoded, encoded into a UTF-8 encoded Str--s_utf8,s_utf8 is a byte array, storing is ' \xe5\x93\x88\xe5\x93\x88 ', but this is just a byte array, But if the UTF-8 encoded byte array is directly print, it will show garbled, why ?

Because the print statement its implementation is going to output the content of the operating system, the operating system will encode the input byte stream according to the system encoding, which explains why the utf-8 format string "haha", the output is "Å 堝搱", because ' \xe5\x93\x88\xe5\x93 \x88 ' with GB2312 to explain, its display is "Å 堝搱".

At the same time, because STR records a byte array, just some encoding of the storage format, as to the output to the file or print out what format, it is entirely dependent on the decoding of the encoding to what it looks like. Here's a bit more on print: When a Unicode object is passed to print, the Unicode object is converted internally, converting the default encoding of the cost (that is, the reason that the SU encoding is normal in the direct output ).

3.3.2 the conversion of STR and Unicode objects

The conversion of STR and Unicode objects, implemented by encode and decode, is used as follows:

Convert gbk ' haha ' to Unicode and then convert to UTF8

3.3.3 files in encoded format that manipulate different files

Set up a file Test.txt, the file format with ANSI, the content is: "ABC Chinese"

Using Python to read

# coding=gbk
Print open("Test.txt").read()
Result for: abc in Chinese
Change the file format to UTF-8:
Result: abc涓枃

Obviously, if the file format is not GBK, you need to decode:

# coding=gbk
Import codecs
Print open("Test.txt").read().decode("utf-8")
Result for: abc in Chinese

The above test.txt I use EditPlus to edit, but when I use Windows to bring the Notepad editor and UTF-8 format, run the Times wrong:

File"ChineseTest.py"3inopen("Test.txt").read().decode("utf-8") ‘gbk‘ codec can‘tcharacterin0sequence

Originally, some software, such as Notepad, would insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a UTF-8 encoded file. So we need to remove these characters when we read them , and the codecs module in Python defines this constant:

# coding=gbk
Import codecs
Data = open("Test.txt").read()
If data[:3] == codecs.BOM_UTF8: data = data[3:]
Print data.decode("utf-8")
Result for: abc in Chinese

3.3.4 file encoding format and the role of the encoding Declaration

What is the encoding format of the source file and the declaration of the string?

First, the file encoding format : The encoding format of the file determines the encoding format of the string in the source file

and the role of the Code declaration : That is, each file in the top place# coding=gbk, the role of three:
1. Non-ASCII encoding will appear in the source file, usually in Chinese;
2. In the advanced IDE, the IDE will save your file format as you specify the encoding format. ( e.g. Pycharm)
3. It is also a confusing place to determine the encoding format used to decode ' ha ' into Unicode, similar to the U ' Ha ' statement in the source code.

See Example:

#coding:gbku‘哈哈‘printprint‘ss:%s‘ % ss

Save this code as a utf-8 text, run, what do you think it will output? Everyone first feel sure the output is definitely:

u‘\u54c8\u54c8‘ ss:haha

But the actual output is:

u‘\u935d\u581d\u6431’ ss:鍝埚搱
     ```
Why is this happening? At this time, the coding statement is working. When running ss = u‘haha’, the whole process can be divided into the following steps:
  1. **Get ‘haha’ code: determined by file encoding format**, ‘\xe5\x93\x88\xe5\x93\x88’ (haha utf-8 encoded form)
  2. When converting to unicode encoding, in the process of this conversion, ** for the decoding of '\xe5\x93\x88\xe5\x93\x88', not with utf-8 decoding, but with the declaration encoding Encoding GBK**, decode '\xe5\x93\x88\xe5\x93\x88' according to GBK, and get ''鍝埚搱'', the unicode encoding of these three words is u'\u935d\u581d\ U6431', can explain why the print repr(ss) output is u'\u935d\u581d\u6431'.

  Ok, here is a bit of a wander, let's analyze the next example:

```python
  #!coding=utf-8
  Ss = u‘haha’
  Print repr(ss)
  Print ‘ss:%s‘ % ss




<div class="se-preview-section-delimiter"></div>

This example is saved as GBK encoding, running the result, unexpectedly:

UnicodeDecodeError: ‘utf8‘ codec can‘t decode byte 0xb9 in
position 0: unexpected code byte

Why is there a UTF8 decoding error here? Think of the last example and understand,
1. The first step of the conversion, because the file encoding is GBK, get is ' haha ' encoding is GBK encoding ' \xb9\xfe\xb9\xfe '

When converted to Unicode, will use UTF8 to the ' \xb9\xfe\xb9\xfe ' decoding, and we look at the Utf-8 Code table will find that it does not exist in the UTF8 encoding table, so will report the above error.

So, everything is clear--

Summarize

notepad++ does not determine or change the encoding of a file based on the file header#！coding, which#！codingis given to the Python interpreter.
2.
For the question of writing Python input Chinese in notepad++:
- If it is to create a new txt in win, and then directly to the suffix py, opened with notepad++, as Python code to write, because the note is ANSI encoding, the default does not recognize Chinese, you have to add#!coding=gbkor#!coding=utf-8whatever you can, At this point, the cmd display is not garbled
- If the note has been changed to utf-8 encoding, Python seems to be able to identify directly, in front of not add#!codingalso can, but because the cmd display is GBK encoding, if you want to display is not garbled, you can first. Decode (' Utf-8 ') Convert to Unicode or directly before Chinese U ' Chinese '
Pycharm Open garbled problem is because previously in the notepad++ Declaration and coding inconsistencies, and Pycharm will be based on the declaration to encode the document, this is still recommended under win unified GBK code bar

The coding methods and problems of win, PY and notepad++

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More