Transferred from: http://www.cnblogs.com/evening/archive/2012/04/19/2457440.html
String encoding Common types: UTF-8,GB2312,CP936,GBK and so on.
In Python, we use decode () and encode () to decode and encode
In Python, the Unicode type is used as the underlying type of the encoding. That
Decode encode
STR---------> Unicode--------->str
u = U ' chinese ' #显示指定unicode类型对象ustr = U.encode (' gb2312 ') #以gb2312编码对unicode对像进行编码
str1 = U.encode (' GBK ') #以gbk编码对unicode对像进行编码
str2 = U.encode (' utf-8 ') #以utf-8 encoding encodes the Unicode pair like u1 = Str.decode (' gb2312 ') #以gb2312编码对字符串str进行解码 to get UNICODEU2 = Str.decode (' Utf-8 ') #如果以utf-8 encoding for STR decoding results, the original Unicode type cannot be restored
As in the above code, STR\STR1\STR2 are string types (str), which brings greater complexity to string manipulation.
The good news is, yes, that's python3. In the new version of Python3, the Unicode type is removed, instead it is a string type (str) that uses Unicode characters, and the string type (str) becomes the underlying type as shown below, and the encoded change to the byte type ( bytes) But the use of two functions does not change:
Decode encode
bytes------> str (Unicode)------>bytes
u = ' Chinese ' #指定字符串类型对象ustr = U.encode (' gb2312 ') #以gb2312编码对u进行编码, Get bytes type Object stru1 = Str.decode (' gb2312 ') # Decoding the string str with GB2312 encoding, obtaining the String type Object U1U2 = Str.decode (' utf-8 ') #如果以utf-8 encoding for STR decoding results, will not be able to restore the original string content
What is not to be avoided is the file read problem:
If we read a file, save the file, use the encoding format, determine the content we read from the file encoding format, for example, we create a new text file from Notepad test.txt, edit the content, save the time to note that the encoding format is optional, for example, we can choose gb2312, Then use Python to read the file contents in the following way:
f = open (' Test.txt ', ' r ') s = F.read () #读取文件内容, if it is an unrecognized encoding format (the identified encoding type is related to the system being used), here will read the failure "' Assuming the file is saved with gb2312 encoding ' ' U = S.decode (' gb2312 ') #以文件保存格式对内容进行解码, get the Unicode string "' We can convert the content into various encodings" ' str = u.encode (' Utf-8 ') # Convert to Utf-8 encoded string strstr1 = U.encode (' GBK ') #转换为gbk编码的字符串str1str1 = U.encode (' utf-16 ') #转换为utf-16 encoded string STR1
Python provides us with a package codecs to read the file, and the open () function in this package can specify the type of encoding:
Import CODECSF = Codecs.open (' Text.text ', ' r+ ', encoding= ' utf-8 ') #必须事先知道文件的编码格式, here the file encoding is used utf-8content = F.read () # If the encoding used in open is inconsistent with the encoding of the file itself, then there will be an error f.write (' The message you want to write ') F.close ()
Other than that
Solve the problem of the regular appearance of Chinese in the conclusion:
1. Open File
MyFile = Codecs.open ("right.html", "R")
No need to set its encoding!
Set the encoding format
str = Myfile.read ()
Content = str.replace ("\ n", "")
Content = Content.decode (' utf-8 ', ' ignore ') #使用utf -8 decoded into Unicode format if the conversion is unsuccessful, ignored, not strongly turned
Regular:
Regex3 = Regex3.decode (' utf-8 ', ' ignore ') #正则也统一使用utf -8 decoded into Unicode format
Then we can
P=re.compile (REGEX3)
Results = p.findall (content)
Call the regular!
Knowledge Points: Encoding format, put down below, file encoding format introduction (transferred):
File encoding format
From the file encoding method, the file can be divided into ASCII code files and binary code file two kinds.
ASCII files are also referred to as text files, which are stored on disk with one byte per character for the corresponding ASCII code. For example, the number 5678 is stored in the following form:
ASC code: 00110101 00110110 00110111 00111000
↓↓↓↓
Decimal code: 5 6 7 8 A total of 4 bytes. ASCII files can be displayed on the screen by characters, such as the source program file is an ASCII file, with the DOS command type to display the contents of the file. Because it is displayed by character, it can read the contents of the file.
Binary files are stored as binary encoding methods for storing files. For example, the number 5678 is stored in the form: 00010110 00101110 is only two bytes. Binary files can also be displayed on the screen, but their contents are unreadable. When processing these files, the C system does not distinguish between types, which are treated as character streams and processed in bytes. The start and end of a stream of input and output characters is controlled only by program control and not by physical symbols such as carriage returns. Therefore, this file is also referred to as a "streaming file".
UCS-2 encoding (16 binary) |
UTF-8 byte stream (binary) |
0000-007f |
0xxxxxxx |
0080-07ff |
110xxxxx 10xxxxxx |
0800-ffff |
1110xxxx 10xxxxxx 10xxxxxx |
-
Question one:
-
Using Save as in Windows Notepad, you can convert between GBK, Unicode, Unicode big endian, and UTF-8 in several ways. The same TXT file, how does Windows recognize the encoding method?
I discovered earlier that Unicode, Unicode big endian, and UTF-8 encoded TXT files would have a few more bytes at the beginning of the FF, Fe (Unicode), Fe, FF (Unicode big endian), EF, BB, BF (UTF-8). But what are the criteria for these marks?
-
Question two:
-
recently on the internet to see a CONVERTUTF.C, the implementation of the UTF-32, UTF-16 and UTF-8 of the three encoding methods of mutual conversion. For the encoding of Unicode (UCS2), GBK, UTF-8, I knew it. But this program makes me somewhat confused, can't remember what UTF-16 and UCS2 have relations.
Looked up the relevant information, finally made clear of these questions, incidentally, also learned some of the Unicode details. Write an article and give it to a friend who has a similar question. This article is as easy to understand as possible when writing, but requires the reader to know what is a byte and what is hexadecimal.
0. Big endian and Little endian
The big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the word "Han" is 6c49. So when you write to a file, do you write 6C in front, or write 49 in front? If 6C is written in front, it is big endian. or write 49 in front, is little endian.
The word "endian" is derived from Gulliver's Travels. The civil war in the small country stems from eating eggs is whether from the Big Head (Big-endian) or from the head (Little-endian) knocked Open, which has happened six times rebellion, one of the Emperor sent life, the other lost the throne.
We generally translate endian into "byte order", the big endian and little endian are called "large tail" and "small tail".
1, character encoding, internal code, incidentally introduced Chinese character coding
Characters must be encoded before they can be processed by the computer. The default encoding used by the computer is the internal code of the computer. Early computers used 7-bit ASCII encoding, and in order to deal with Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.
GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.
GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. Now the PC platform must support GB18030, the embedded products are not required. So mobile phones, MP3 generally only support GB2312.
From ASCII, GB2312, GBK to GB18030, these coding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters. In these codes, English and Chinese can be handled in a unified manner. The method of distinguishing Chinese encoding is that the highest bit of high byte is not 0. According to the programmer, GB2312, GBK, and GB18030 belong to the double-byte character set (DBCS).
Some Chinese Windows default internal code or GBK, you can upgrade to GB18030 through the GB18030 upgrade package. But GB18030 relative GBK increases the character, the ordinary person is difficult to use, usually we still use the GBK to refer to the Chinese Windows inside code.
Here are some details:
GB2312 the original text or location code, from the location code to the inner code, you need to add A0 on the high and low byte respectively.
In DBCS, GB internal code storage format is always big endian, that is, high in front.
The highest bit of the two bytes of the GB2312 is 1. But the code bit that meets this condition is only 128*128=16384. So the low-byte highest bits of GBK and GB18030 are probably not 1. However, this does not affect the parsing of DBCS character streams: When reading a DBCS character stream, you can encode the next two bytes as a double byte as long as you encounter a byte with a high level of 1, without having to control what the low-byte high is.
2. Unicode, UCS, and UTF
The previously mentioned encoding methods from ASCII, GB2312, GBK to GB18030 are backwards compatible. Unicode is only compatible with ASCII (more precisely, iso-8859-1 compatible) and is incompatible with GB code. For example, the Unicode encoding of the word "Han" is 6c49, and the GB code is baba.
Unicode is also a character encoding method, but it is designed by international organizations and can accommodate all languages in the world coding scheme. The scientific name for Unicode is "Universal multiple-octet Coded Character Set", referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".
According to Wikipedia (http://zh.wikipedia.org/wiki/), there are two organizations in history that attempt to design Unicode independently, namely the International Organization for Standardization (ISO) and the association of a software manufacturer (unicode.org). ISO 10646 project was developed and the Unicode Association developed the Unicode Project.
Around 1991, both sides recognized that the world did not need two incompatible character sets. They then began to merge the work of the two sides and work together to create a single coding table. Starting with Unicode2.0, the Unicode project uses the same font and loadline as ISO 10646-1.
At present, two projects are still present, and the respective standards are published independently. The current version of the Unicode Association is the 2005 Unicode 4.1.0. The newest standard for ISO is 10646-3:2003.
UCS Specifies how multiple bytes are used to represent various words. How these encodings are transmitted is specified by the UTF (UCS Transformation Format) specification, and common UTF specifications include UTF-8, UTF-7, UTF-16.
The RFC2781 and RFC3629 of the IETF, with the consistent style of RFC, describe the coding methods of UTF-16 and UTF-8 in a clear, crisp, yet rigorous manner. I always remember that the IETF is the abbreviation for Internet Engineering Task force. But the RFCs that the IETF is responsible for maintaining are the basis for all the specifications on the Internet.
3, UCS-2, UCS-4, BMP
UCS has two forms: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually 31 bits, the highest bit must be 0). Let's do some simple math games:
UCS-2 has 2^16=65536 code bit, UCS-4 has 2^31=2147483648 yards.
The UCS-4 is divided into 2^7=128 groups according to the highest byte maximum of 0 bytes. Each group is then divided into 256 plane according to the sub-high byte. Each plane is divided into 256 rows according to the 3rd byte (rows), and each row consists of 256 cells. Of course the cells in the same row are only the last byte and the rest are the same.
The Plane 0 of group 0 is known as the basic multilingual Plane, or BMP. Or UCS-4, a code bit with a height of two bytes of 0 is called a BMP.
The UCS-4 bmp is removed from the previous two 0 bytes to get the UCS-2. The BMP of UCS-4 is obtained by adding two 0 bytes before the two bytes of the UCS-2. No characters in the current UCS-4 specification are allocated outside of the BMP.
4. UTF code
The UTF-8 is to encode the UCS as a 8-bit unit. The encoding from UCS-2 to UTF-8 is as follows:
For example, the Unicode encoding of the word "Han" is 6c49. 6c49 is between 0800-ffff, so I'm sure to use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, using this bitstream in turn instead of the template x, get: 11100110 10110001 10001001, that is, E6 B1 89.
Readers can use Notepad to test if our code is correct.
The UTF-16 encodes the UCS as a 16-bit unit. For UCS codes smaller than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes that are not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for the time being, UTF-16 and UCS-2 can be considered basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte order.
5. UTF byte order and BOM
UTF-8 is a byte-coded unit with no byte order problem. UTF-16 takes two bytes as the encoding unit, before interpreting a UTF-16 text, it is first to clarify the byte order of each coding unit. For example, the Unicode encoding for receiving a "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 byte stream "594E", then is this "Kui" or "B"?
The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte Order Mark. The BOM is a bit of a smart idea:
There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted.
This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.
The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF (readers can verify it with the coding method we described earlier). So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.
Windows uses a BOM to mark the way a text file is encoded.
6. Further references
The main reference information in this paper is "Short Overview of Iso-iec 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).
I've also looked for two good-looking materials, but because I've got the answers to the questions I started, I didn't see them:
- "Understanding Unicode A General Introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?) SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER04A)
- "Character Set Encoding Basics Understanding Character Set encodings and Legacy encodings" (HTTP://SCRIPTS.SIL.ORG/CMS/SC Ripts/page.php?site_id=nrsi&item_id=iws-chapter03
[Reprint]python encode and decode function description