(reproduced) character encoding and Python using Encode,decode conversion utf-8, GBK, gb2312
(http://www.cnblogs.com/jxzheng/p/5186490.html)
ASCII code
The standard ASCII code uses a 7-bit binary number to represent uppercase or lowercase letters, 0 to 9 punctuation marks, and special control characters for use in American English.
In the standard ASCII code, the highest bit (B7) is used as the parity bit, so-called parity, refers to the code in the process used to verify the occurrence of errors in a way, the general sub-check and parity two. Odd check rules: The correct code in one byte of the number of 1 must be odd, if not odd, the highest bit B7 Tim 1; Parity rule: The correct code in a byte of 1 must be an even number, if not even, the highest bit B7 add 1.
The latter 128 are called extended ASCII codes. Many x86-based systems support the use of extended (or "high") ASCII. The extended ASCII code allows the 8th bit of each character to be used to determine additional 128 special symbol characters, foreign letters, and graphic symbols.
Common ASCII code sizes:
NewLine LF is 0x0a, enter CR is 0x0d, space is 0x20, ' 0 ' is 0x30, ' a ' is 0x41, ' a ' is 0x61
Query ASCII techniques for easy querying of ASCII code characters: Create a new text document, press ALT + to query the code value (note, here is the decimal), release to display the corresponding characters. For example: Press and hold alt+97, the ' A ' will be displayed.
Extended ASCII Code
The extended ASCII code is a character from 128-255.
Unicode encoding
Note: Unicode is just a set of symbols that specifies the binary code of the symbol, but does not specify how the binary code is stored.
The called Unicode encoding refers to the UCS encoding method, which is the Unicode binary code that is directly stored in the symbol.
UTF-8 encoding
UTF-8 is the most widely used form of Unicode implementation on the Internet.
UTF-8 is a variable-length encoding that uses 1-4 bytes to represent a symbol, choosing a byte representation of different lengths depending on the symbol.
The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
Unicode symbol range (16 binary) |
UTF-8 encoding Method (2 binary) |
0000 0000-0000 007F |
0xxxxxxx |
0000 0080-0000 07FF |
110xxxxx 10xxxxxx |
0000 0800-0000 FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0001 0000-0010 FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.
How Unicode and UTF-8 are converted:
The simplest way to do this in Windows is to use Notepad to open the document and choose Save As encoding.
Solving python garbled problems
strings are encoded in Unicode within Python, so other languages are decode converted to Unicode encoding and then encode converted to UTF8 encoding. Encoding is a way of representing abstract characters in binary data, and UTF8 is a way of encoding.
The string encoding in the code is the same as the code file encoding by default.
The Python2 in Unicode and Python3 are equivalent in Str. You can view s.__class__, if <class ' str ' > is Unicode encoded and text data, if <class ' bytes ' > is UTF8 encoded and binary data. STR (s, ' UTF8 ') and s.decode (' UTF8 ') are equivalent.
If the string is defined in code as S=u ' Chinese ', then S is the Python internal encoding Unicode.
The Unicode type is then decoded to an error.
Determines whether a string is Unicode method Isinstance (S, Unicode), Unicode in Python2, and str equivalent in Python3, So in Python3, determine if a string is a Unicode method of isinstance (S, str).
Get system default encoding:
Import Sysprint (sys.getdefaultencoding ())
Some IDE output is garbled because the console cannot output string encoding is not a problem for the program itself. For example, if the console of Windows is gb2312, the output format of the UTF8 is not output correctly.
One way to avoid garbled output format is gb2312:
1 #coding =utf-8 2 3 s= ' Chinese ' 4 5 6 if (isinstance (S, str)): 7 #s为u ' Chinese ' 8 s.encode (' gb2312 ') 9 else:10 #s为 ' Chinese ' s.decode (' UTF8 '). Encode (' gb2312 ')
Using standard library codecs modules
Codecs.open (filename, mode= ' R ', Encoding=none, errors= ' strict ', buffering=1)
1 import codecs2 f = codecs.open (filename, encoding= ' utf-8 ')
Read the Utf-8 file in this way above and automatically convert to Unicode. However, it must be clear that the file type is UTF8 type. If there is a Chinese character in the file, it is not a byte to read but all the bytes of the whole Chinese character are read in and converted to Unicode (conjecture is related to the UTF8 encoding of Chinese characters).
The code below is also a way to read and write using codecs
#coding =utf-8import codecsfin = open ("Test.txt", ' r ') Fout = open ("Utf8.txt", ' w ') reader = Codecs.getreader (' GBK ') (Fin) writer = Codecs.getwriter (' GBK ') (fout) data = Reader.read #10是最大字节数, the default value is 1, which means as large as possible. You can avoid processing large amounts of data at one time while data: writer.write (data) Data = Reader.read (10)
Python string Encoding Understanding (reprint)