Quick Start With Python character encoding and start with Python character encoding
Preface
For many people who are familiar with Python, it is especially difficult to handle Character Processing and language integrity and reliability.
This article focuses on Python 2.7, mainly because the encoding of the three pairs has been greatly improved and the actual principle is the same. Just change the Operation Command.
After reading this article, you can easily solve text processing and special platforms (Windows ?) .
Reading suggestions
This article consists of the following parts:
1. Principle
2. Specific operations
3. Recommended usage habits
4. troubleshooting
If you want to know my usage habits, you can jump to the suggested usage habits.
If you only want to solve the problem, you can jump to the troubleshooting page.
I hope this article will help you.
Principle
For ease of understanding, here we will not talk about the theory only for analogy. We 'd like to learn more about the theory of various encodings.
First, Let's explain why we have encountered various encoding problems:
1. because we do not have a uniform Encoding
2. Because we didn't use the correct command (pass data)
Let's talk about what encoding is. The Python encoding seems complicated. In fact, there are only two types of encoding: Unicode and binary.
1. Unicode is familiar:, Is\u0000
Such
2. binary encoding is also very simpleThat is\x00\x00
This is what we usually seeutf-8
,cp936
Both are binary codes.
3. binary encoding is concrete,10001100
It can be stored as is, while Unicode is abstract and cannot be stored in this way.
# Coding = utf8 # Unicode encoding demo print ('unicode: ') print (repr (u'unicode encoding') '# binary encoding demo print (u'binary encoding :') print (repr ('unicode encoding') '# Just look at it. The Code does not have to be further explored.
Besides, we can only operate between the same types of codes.
A simple analogy
We compare a string of data to roast duck. We treat roast duck differently as humans and ducks.
What we see is a side dish in the evening, and what the duck sees is his second party.
When I visit the roast duck store, an error code will be returned.
Because I saw the second place in the world at the roast duck store.
We are familiar with the following encoding methods:utf-8,unicode,ucs-bom
This is the core of coding and is very important.
Finally, let's talk about the Python environment.
1. The code itself is decoded using Ascii. If there is any content in the file that cannot be decoded using Ascii, tell Python how to decode it.
2. A large number of internal commands accept Unicode by default.
# The command to be notified is the following line. If you delete this line, an error is returned. # coding = utf8print (u'test Code ')
Operations
Needless to say, if we want to construct the content of various codes, let's look at the following:
# Coding = utf8 # The Unicode string unicodeString = u'unicode string' will be constructed by default when u is added before the string # The default encoding will be constructed if nothing is added before the string (the first line limits the current utf8) utf8String = 'utf-8 string' # Of course, there is no first line, and the default encoding is Ascii
So how can they be converted? It's also very simple:
# Connect to a program # convert Unicode to binary: utf8unicodeString. encode ('utf8') # convert binary encoding to Unicodeutf8String Based on the encoding type. decode ('utf8') # If binary encoding is mixed with something strange, you can use the special decode policy print (repr ('u8 characters \ x00 string '. decode ('utf8', 'replace ')))
So what will happen:
# Connect to a program # If we convert them into the same encoding method, we can operate (for example, add) print (repr (unicodeString + utf8String. decode ('utf8') print (repr (unicodeString. encode ('utf8') + utf8String) # But if it is not converted, of course, the world's roast duck is full. unicodeString + utf8String, encoding conversion requires us to tell the program how to do it # All 'decode' operations will generate Unicode encoding, which is to make it easier for me to accept a large number of internal Unicode commands.
So we need to determine the encoding used by the program. This is what we need to tell the program.
1. On the one hand, the same encoding is determined during string operations.
2. On the other hand, when using commands not self-written, Unicode is generally used, or binary-encoded commands are received.
# Coding = utf8 # example of writing a file # Unicodewith open('Unicode.txt ', 'w') as f: f. write (u'unicode test') # or use the command to receive binary code with open('Utf8.txt ', 'wb') as f: f. write ('utf8 test') # You can perform a test on the other hand. Naturally, an error is reported. # binary commands help you perform operations without knowing how to decode them (write files)
Recommended usage habits
I believe that I have finished my understanding of coding.
Why do we encounter various encoding problems:
1. because we do not have a uniform Encoding
2. Because we didn't use the correct command (pass data)
So here I want to reiterate the 8-character mantra: Determining encoding and similar interactions.
1. If you have any questions, ask yourself what encoding I am using.
2. Which encoding should I use to interact with each other?
Here is my usage habits:
1. determine an internal Encoding
2. The internal encoding priority is as follows: the encoding required by the program, the encoding used by third-party packages, the encoding you like, and Unicode
3. change to a specific encoding during output.
Remember to determine the internal encoding before starting the entire program. Otherwise, a mess of encoding will produce many unnecessary bugs.
Do not superstitious about internal Unicode. For example, Evernote development should determine the internal Encoding Based on Utf8 used by third-party packages.
Troubleshooting
Encoding Recognition
If encoding needs to be determined, how can we determine the encoding to get a string of binary data?
The simplest method isChardet(Installation required)
python -m pip install chardet
Easy to use:
# Coding = utf8from chardet import detectprint (detect ('this is a string of utf8 test characters') # result: '{'confidence': 0.99, 'encoding ': 'utf-8 '}'
For example, if you capture a website, the header file may prompt you how to decode it. Remember not to forget it.
Encoding conversion
It is very likely that the strings contain strange things that make decoding impossible even if the encoding type is correct.
I know that I have talked about it before, but some people may jump to the issue to answer it.
You can usedecode
The second parameter:
# Coding = utf8 # The \ x00rubbishuf8string = 'utf-8 characters \ x00 character string 'print (repr (rubbishuf8string. decode ('utf8', 'replace ') print (repr (rubbishuf8string. decode ('utf8', 'ignore ')))
Encoding on special platforms
Many people say that Windows is a pitfall, even under Python 3.
Because the Chinese file names are garbled.
Here we use a clever method: the platform encoding is special. At least the command line reading and creating a folder won't contain garbled characters.
import sys, osfor folder in os.walk('.').next()[1]: print(folder.decode(sys.stdin.encoding))
The same input and output can be optimized as follows:
import sysdef sys_print(msg): print(msg.encode(sys.stdin.encoding))def sys_input(msg): return raw_input(msg.encode(sys.stdin.encoding)).decode(sys.stdin.encoding)
File writing
If you do not know how to decode the captured content, but want to write the file, what should I do?
When writing files, use the Binary command:
# Coding = utf8import urllibwith open('Utf8.txt ', 'wb') as f: f. write ('utf8 test') # For example, if you capture a webpage and do not know the encoding, you can write the file for a series of operations. content = urllib. urlopen ('HTTP: // www.baidu.com '). read () with open('baidu.txt ', 'wb') as f: f. write (content)
Bare Unicode characters
What if Unicode is saved into six Ascii characters? Actually, you can.decode
# Coding = utf8 # This is a common Unicodes = u'test' for I in s: print (I) print (repr (s) # This is a bare Unicode, actually saved six Asciis = repr (s) [2:-1] for I in s: print (I) print (repr (s )) # conversion is actually quite simple. s = s. decode ('unicode-escape ') for I in s: print (I) print (repr (s ))
Summary
The above is a detailed description of all the content of Python character encoding. I hope you can read this article to help you. What are the shortcomings? Hope you can help me a lot.