Python2.7 encoding and python2.7 Encoding
0. Write in front
Cause: I encountered some problems when I wrote a data preprocessing program. I couldn't find it (wrong or garbled) When I used the re module's regular expression search method.
After going through: Data Query and experiment, I guess: We found that the str-type string encoded with utf8 does not work in the search method, because str is a byte string, there is no fixed one-to-one correspondence between and characters, and regular expressions cannot be correctly matched Using BYTE strings.
Result: both the regular expression and the target string use the unicode type. The unicode and character are two bytes corresponding to one character. The regular expression can be used to match the character.
Later: I suddenly thought we should summarize the encoding problem to prevent further pitfalls. So with this article.
1. ascii, unicode, utf8
Ascii code: the earliest Code, containing only 127 characters, including English letters, numbers, punctuation marks and some other symbols. One byte represents one character.
Unicode (unified code): A single byte is not enough. Characters in various languages around the world need to be encoded, so unicode sets a unique encoding for all characters. Generally, two bytes are used to represent a character (four bytes are used for Some uncommon words ). Therefore, we need to understand that the unicode encoding mentioned below is dubyte encoding (one character and two bytes ).
Uft8: For ascii characters, only one byte is required. unicode also sets two bytes for these characters. If an article is full of English (ascii characters ), A lot of space is wasted (one byte can be stored, and two bytes are used), so utf8 is generated. Utf8 is a variable-length encoding method. According to different symbols, the length of the byte is changed. ascii is encoded into 1 byte, and Chinese characters are usually encoded into 3 bytes, some uncommon characters are encoded into 4 ~ 6 bytes.
Unicode encoding is used in the computer memory.
In python, it is recommended that unicode encoding be used in the program process. utf8 is used for saving and reading files (utf8 is used for decode and encode when reading and writing disk files, for decode and encode, see the following section 4th ).
2. encoding Declaration
Python uses ascii encoding to explain the source file by default.
If the source file contains non-ASCII characters, if it is not declared at the beginning, an error is returned.
It can be declared as utf8, telling the interpreter to use utf8 to read the file code. At this time, the source file contains Chinese characters and no error is reported.
# Encoding = utf8 if this line is not added, the error print 'interpreter uses the corresponding encoding to explain the python Code' is returned'
3. str and unicode in python2.7
When you debugger, you will find that the strings in python2.7 generally have two types: unicode and str.
Str is a bytecode that converts a string to a byte based on a certain encoding. At this time, there is no fixed one-to-one correspondence between characters and bytes.
Unicode is a string encoded in unicode. At this time, a character corresponds to two bytes, one-to-one correspondence.
Directly assign a value to a string of the str type and str type to a byte string, which is encoded into bytes according to the encoding at the beginning.
When the value is assigned, add a u before the string, and the type is unicode, which is directly encoded according to unicode.
S1 = 'byte string' print type (s1) # output <type 'str'>, Which is encoded into corresponding bytes according to the encoding at the beginning. Print len (s1) # output 9 because it is UTF-8 encoded. A Chinese Character occupies three bytes, and three words occupy nine bytes. S2 = u'unified Code' print type (s2) # output <type 'unicode '>, encoded in unicode, two bytes and one character. Print len (s2) # output 3. unicode uses the number of characters to calculate the length. From this perspective, unicode is the true string type.
4. encode and decode in python2.7
Encode: encode the unicode type to obtain the str type. That isUnicode-> encode (based on the specified encoding)-> str.
Decode: decode the str type to obtain the unicode type. That isStr-> decode (based on the specified encoding)-> unicode.
Note: encoding must be specified for both encode and decode.
Because we need to know what the original encoding is and according to the new encoding method, we need to use two types of encoding. Here there is a unicode by default, therefore, you need to specify another encoding method. Decoding is also true.
The two methods are converted between unicode and str using the specified encoding.
S3 = u'uniform Code '. encode ('utf8') print type (s3) # output <type 'str'> s4 = 'byte string '. decode ('utf8') print type (s4) # output <type 'unicode '>
Abnormal use of encode (not recommended): encode the str type, because encode requires the unicode type. In this case, python uses the default system encoding decode to convert it to the unicode type, use your encoding to encode. (Note that the system encoding here is not the encoding at the beginning. For more information, see section 5th)
Decode is not used properly: decode the unicode type and an error is reported directly.
5. Modify the default system encoding.
The system uses ascii encoding by default and needs to be modified accordingly.
The difference between this encoding and the encoding at the beginning is that the encoding at the beginning is the encoding of the file content. The encoding here is the default encoding used in some python methods, for example, decode encoding is used by default when encode is used for str, for example, the encoding of the write encode for file write operations (see Figure 7th for details)
Import sysreload (sys) sys. setdefaultencoding ('utf8') s = 'byte string str's. encode ('utf8') # equivalent to s. decode (system code ). encode ('utf8 ')
6. view the file encoding.
import chardetwith open(filename,'r') as f: data = f.read() return chardet.detect(data)
7. file read/write (although all words, but very important)
First of all, remember that reading and writing both files use the str type, which is a byte.
By default, the built-in open in python reads byte characters in the format of str when reading files. You must use the correct encoding to decode the correct unicode, so you must know the encoding in the original file.
Writing a file is also a truth. It is written in bytes in the str type. This str is encoded in a certain encoding method. Be sure to use the correct encoding method, generally, the file is written after UTF-8 encoding.
If you use the unicode type for writing, python will encode unicode into str based on the system's default encoding before writing the file. This is because str is required to write data to a file. If it is str, it is written. If it is not, it is converted into str and then written.
The simple principle is to use str as much as possible to write data, so as to avoid using the default encoding, so that you do not need to modify the default encoding at the beginning.
The open Method in the codecs module in python can specify an encoding. It ensures that the bytes read and written are encoded according to the specified encoding.
In this way, when reading a file, the read str will be encoded as unicode according to the specified decode.
When writing a file: If it is unicode, it will be converted to str Based on the specified encoding. If it is str, it will decode str Based on the default encoding to get unicode, write the code into str Based on the specified encoding encode.
The simple principle is to use unicode as much as possible to avoid using the default encoding, so that you do not need to modify the default encoding at the beginning.
Note: For other methods of reading and writing files, You Need To debugger yourself to check the encoding problem. For example, when I read an excel file in python, It is unicode rather than str.
8. General handling points
(1) first, change the default encoding and system default encoding of the source file to utf8.
(2) The unicode type is used during program execution.
(3) for reading and writing files (using the default open built in python), str is obtained. encode and decode corresponding to str.
Summary:
Set the default encoding to utf8;
Str type obtained by reading a file: str-> decode ('utf8')-> unicode
Program processing: unicode
Write File: unicode-> encode ('utf8')-> str, Write File with str type
Of course, the premise is that all files are in utf8 format, including source files and read/write data files.
I also want to say:
It is only recommended that unicode be used during program writing, if you encounter encoding problems, you can think about whether unicode is not used in a unified manner. (This article begins with a situation where unicode needs to be used in a unified manner)
It may be difficult to make all unicode. You can considerIn normal times, str is encoded in utf8 format. In some cases, unicode must be used before conversion to unicode,
In fact, we can find out the above ideas and identify any Encoding Problems.