In the Python language, uincode string processing has always been a confusing problem. Many python enthusiasts often have trouble figuring out the difference between Unicode, UTF-8, and many other encodings. This article describes the knowledge of the Chinese processing of Unicode and Python. Let's take a look at the little series.
In the Python language, uincode string processing has always been a confusing problem. Many python enthusiasts often have trouble figuring out the difference between Unicode, UTF-8, and many other encodings. I used to be a member of this "nerve-wracking family", but after more than half a year of efforts, now finally get to know some of the relationship. Now it is organized as follows, with fellow colleagues to enjoy. At the same time also hope to borrow this short essay to attract more real experts to join in, together to improve our Python Chinese environment.
Some of the viewpoints mentioned in this paper are based on the data obtained, and some of them are obtained by using the "Guess Plus verification" method. I ask myself Caishuxueqian, which is afraid to hide a lot of mistakes. You crossing, if there is a master, if any one found that there is a mistake, million hope that you are very generous to enlighten. The author oneself Disgrace Matter small, the viewpoint mistake the other person matter big, therefore everybody big can not scruple the author's face question.
section I text encoding and Unicode standard
To interpret a Unicode string, you must first start with what is Unicode encoding. As we all know, text display has always been the basic problem that computer display function must solve. And the computer is not literate, it is actually the text as a string of "pictures", each "picture" corresponding to a character. Each computer program in the display of text, you must use a record of the text "picture" How to display the "Picture" collection, find each character corresponding to the "picture" data, and leaf out the word "draw" to the screen. This "picture" is called "The Matrix", and the record matrix displays the data the collection is called "the character set". For the convenience of the program, the font data of each character must be ordered in the character set, and each character will be assigned a unique ID, which is the character encoding. When the computer does character data processing, it always uses this encoding to represent the character it represents. Therefore, a character set specifies the character data that a set of computers can handle. Obviously, different countries specify a different character set size, and the corresponding character encoding is different.
In computer history, the most widely used standardized character set is the ASCII character set. It is actually a standard developed by the United States for North American users. It uses 7 bits encodings, which can represent 128 characters. This character set was finally formally adopted by ISO as an international standard, and is widely used in various computer systems. Today, all of the PC's BIOS contains the ASCII character set of the font, which is visible in the hearts of all.
However, the limitations of ASCII encoding are exposed when computers are widely available in various countries: it has limited character space and cannot accommodate more characters, but most languages require more than 128 characters. In order to be able to properly handle the native language, the official or folk in various countries began to design the national text encoding set of work, and eventually emerge a lot of language for each country character encoding, such as the iso-8859-1 encoding for Western European characters, for the Simplified Chinese GB series encoding, There are also shift-jis codes for Japanese, and so on. Also, in order to ensure that each new character set is compatible with the original ASCII text, most of the character sets invariably use ASCII characters as their first 128 characters and encode them with ASCII encoding one by one.
In this way, the display of national text is solved, but also brings a new problem: garbled. Character sets used in different countries and regions are often not constrained by uniform specifications, so each character set encoding is often incompatible. The same word is encoded differently in two different character sets, and the same encoding differs in different character sets. A piece of text written in code A is often displayed as a jumble of characters on a system that only supports encoding B. To make matters worse, the encoding lengths used by different character sets are often not the same, and programs that only handle single-byte encodings are often notorious for "half-word" problems when they encounter double-byte or even multibyte-encoded text. This makes the already chaotic situation even more disorderly become a group of porridge.
In order to solve these problems once and for all, many large companies and organizations in the industry have jointly proposed a standard, which is Unicode. Unicode is actually a new character encoding system. It encodes each character in the character set with an ID number of two bytes long, thus stipulating an encoding space that can hold up to 65,536 characters, and characters commonly used all the current international codes in the world. Because of the thoughtful consideration in designing the code, Unicode solves the problem of garbled and "half-word" of other character sets in the data exchange. At the same time, the design concept of "Unicode as internal code" has been put forward by the designers of Unicode, taking into account the fact that large numbers of font data are still used by various countries to develop codes. In other words, the character display program still uses the original encoding and code, and the internal logic of the application will be Unicode. When the text is displayed, the program always converts the Unicode-encoded string into its original encoding for display. In this way, you do not have to redesign the font data system to use Unicode. At the same time, in order to differentiate itself from codes already developed by countries, Unicode designers refer to Unicode as "wide-character encoding" (wide characters encodings), and coding practices developed by countries are known as "multibyte encodings" (Multi Bypes Encodings). Today, the Unicode system introduces four bytes of extended encoding and is gradually merging with UCS-4, also known as the ISO10646 coding specification, hoping someday to be able to unify all of the world's text encodings with the ISO10646 system.
The Unicode system was born with great expectations and was quickly accepted as an ISO-recognised international standard. However, Unicode in the promotion process has been the first European and American users of the opposition. The reason they objected is very simple: European and American users originally used the encoding is single-byte long, the double-byte Unicode processing engine can not handle the original single-byte data, and if you want to convert all existing single-byte text to Unicode, the workload is too large. Furthermore, if all of the single-byte encoded text is converted to a double-byte Unicode encoding, all of their text data takes up twice times as much space and all the handlers are rewritten. They can't accept the expense.
Although Unicode is an internationally recognized standard, it is impossible for a standardization organization to take into account the requirements of the largest computer user group in Europe and America. Thus, in consultation with the parties, a variant version of Unicode was produced, which is UTF-8. UTF-8 is a multi-byte encoding system with the following encoding rules:
1, UTF-8 code is divided into four zones:
One region is a single-byte encoding,
The encoding format is: 0xxxxxxx;
Correspondence unicode:0x0000-0x007f
The second zone is a double-byte encoding,
The encoding format is: 110xxxxx 10xxxxxx;
Correspondence UNICODE:0X0080-0X07FF
The three zones are three-byte encoded,
The encoding format is: 1110xxxx 10xxxxxxx 10xxxxxx
Correspondence UNICODE:0X0800-0XFFFF
The four zones are four-byte encodings,
The encoding format is: 11110xxx 10xxxxxxx 10xxxxxx 10xxxxxx
Correspondence UNICODE:0X00010000-0X0001FFFF
The five zones are five-byte encodings,
Encoding format: 111110xx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx
Correspondence UNICODE:0X00200000-0X03FFFFFF
The six-zone is six-byte encoded,
Encoding format: 111110x 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx
Correspondence UNICODE:0X04000000-0X7FFFFFFF
Of these, the two or three zone corresponds to the Unicode double-byte coding area, while the four zones are for the four-byte extension of Unicode (according to this definition, UTF-8 also has five and six zones, but the author did not find in the GNU glibc Library, I do not know why);
2, each district in the order of one or two, three or four, five or six, the corresponding position of the characters and Unicode remain the same;
3, the non-display Unicode character encoding is 0 bytes, in other words, they are not income UTF-8 (this is the author from the GNU C Library comments from the statement, may not be in conformity with the actual situation);
According to UTF-8 coding rules, it is not difficult to find out that the 128 codes in one area are actually ASCII encoded. So UTF-8 's processing engine can handle ASCII text directly. However, UTF-8 's compatibility with ASCII encoding is at the expense of other encodings. For example, the original Chinese, Japanese, and Korean characters are basically double-byte encoding, but their position in the Unicode encoding corresponds to the three areas in the UTF-8, each character encoding to three bytes long. In other words, if we convert all the existing Chinese, Japanese, and Korean encoded non-ASCII character text data into UTF-8 encoding, it will be 1.5 times times the size of the original.
Although I personally think that the UTF-8 encoding is somewhat unfair, but it solves the ASCII text to the Unicode world transition problem, so still won wide recognition. The typical example is that the default encoding for XML and Java:xml literals is UTF-8, and the Java source code can actually be written with UTF-8 characters (the JBuilder user should be impressed). There is also the famous GTK 2.0 in the world of open source software, which uses UTF-8 characters as internal code.
Speaking so much, it seems that the topic is a bit far away, and many python enthusiasts may have begun to worry: "What does this have to do with Python?" "Okay, now we're going to turn our eyes to the Python world.
The second section of Python's Unicode encoding system
To correctly handle multilingual text, Python introduced a Unicode string after version 2.0. Since then, there have been two types of strings in the Python language: a traditional Python string that has been used for a long time before version 2.0, and a new Unicode string. In the Python language, we use the Unicode () built-in function to "decode" a traditional Python string, get a Unicode string, and then "encode" the Unicode string by using the Encode () method of the Unicode string. "Encode" it as a traditional Python string the above content presumably every Python user is in the chest. But you know, Python's Unicode string is not really a "Unicode-encoded string," but rather follows a unique rule of its own. The content of this rule is very simple:
1. The Python Unicode encoding of ASCII characters is the same as their ASCII encoding. in other words, the ASCII text in Python's Unicode string is still a single-byte length encoding;
2. Characters other than ASCII characters whose encoding is the Unicode standard encoded double-byte (or four-byte) encoding. ( I suspect that the Python community is trying to make such quirky standards, perhaps to ensure the universality of ASCII strings)
Typically, in Python applications, Unicode strings are used as internal processing, while terminal display work is done by traditional python strings (in fact, Python's print statements simply cannot print out double-byte Unicode encoded characters). In the Python language, the traditional Python string is called a "multibyte-encoded" string that represents a variety of strings that are "encoded" into specific character set encodings (such as GB, BIG5, Koi8-r, JIS, iso-8859-1, and of course UTF-8) Whereas the Python Unicode string is a "wide character encoding" string representing Unicode data that is "decoded" from the specific character set encoding. So typically, a python application that uses Unicode encoding tends to handle string data in the following ways:
def foo (string, encoding = "gb2312"): # 1. Convert multi-byte string to wide character stringu_string = Unicode (string, encoding) # 2. Do something...# 3. Convert wide character string to printable Multi-Byte stringreturn u_string.encode (encoding)
We can cite an example: Python colleagues who often use PyGTK2 for Xwindow programming in red Hat Linux environments may have already discovered the situation: if we write the following statement directly:
Import Pygtkpygtk.require (' 2.0 ') Import Gtkmain = Gtk. Window () # Create a windowmain.set_title ("Hello") # notice!
Such a statement will appear on the terminal at the end of this warning:
Error converting from UTF-8 to 'GB18030': 转换输入中出现无效字符序列
and the program window title will not be set to "Hello", but if the user installs the Chinese codec, and the last sentence above is changed to:
u_string = Unicode (' Hello ', ' gb2312 ') main.set_title (u_string)
The program window title will be correctly set to "Hello". What is this for?
The reason is simple. Gtk. The Window.set_title () method always considers a header string that it receives as a Unicode string. When the PYGTK system receives the user's main.set_title () request, the resulting string is processed at some point:
Class Window (GTK. Widget): ... def set_title (self, title_unicode_string): ... # notice! Unicode-Multi-Byte utf-8real_title_string = Title_unicode_string.encode (' Utf-8 ') ... # pass read_title_string to GTK2 C API to draw the title ...
We see that the string title_unicode_string is "encoded" inside the program into a new string: Real_title_string. Obviously, this real_title_string is a traditional Python string, and its encoding is UTF-8. In the previous section, the author has mentioned that GTK2 's internal use of strings are encoded by UTF-8, so the GTK2 core system can correctly display the title after receiving real_title_string.
So, what if the user enters a title that is an ASCII string (for example: "Hello World")? Let's recall that the definition of a Python Unicode string is not difficult to find, and if the user's input is an ASCII string, it is itself that is being re-encoded. That is, if the value of Title_unicode_string is an ASCII string, the value of real_title_string and title_unicode_string will be exactly the same. An ASCII string is also a UTF-8 string, and passing it to the GTK2 system will not have any problems.
The above example is about Linux PyGTK2, but similar problems are not only seen in PYGTK. In addition to PYGTK, today's various Python-bound graphics packages, such as PYQT, Tkinter, and more, are more or less subject to Unicode processing.
Now that we've got a sense of Python's Unicode string encoding mechanism, the question we're most interested in is still unresolved: how do we get Python to support Unicode processing of Chinese? This is a question we will explain in the next section.
Section III How to enable Python Unicode strings to support Chinese
After reading the title of this section, some Python colleagues may disagree: "Why do you have to use Unicode to handle Chinese?" Isn't it nice that we usually use traditional python strings? "It is true that in general it is sufficient to use traditional python strings for operations like string joins, substring matching, and so on." However, if some advanced string manipulation is involved, such as regular expression matching with multi-country text, text editing, expression parsing, and so on, these large amounts of mixed single-byte and multibyte-text operations can be cumbersome if used with traditional strings. Besides, the traditional string never solves the damned "half-word" problem. And if we can use Unicode, then these problems can be solved. Therefore, we must face up and try to solve the problem of Chinese Unicode processing.
As described in the previous section, we know that if you want to use Python's Unicode mechanism to process strings, you can have an encoding/decoding module capable of bidirectional conversion of multi-byte Chinese encoding (including GB-coded series and BIG5 series) and Unicode encoding. In python terms, such an encoding/decoding module is called codec. So the next question becomes: how do we write such a codec?
If the Unicode mechanism of Python is hardcoded in the Python core, then adding a new codec to Python will be a hard and arduous task. Fortunately, Python's designers are not that stupid, they offer an extensibility mechanism that makes it easy to add new codecs to Python.
Python's Unicode processing module has three most important components: one is the codecs.py file, the second is the encodings directory, and the third is the aliases.py file. The first two are located in the installation directory of the PYTHON system library (in the case of the Win32 release, in the $python_home/lib/directory; if Red Hat Linux is in the/usr/lib/python-version/directory, Other systems can look for this), and the last one is located in the Encodings directory. Next, we explain each of these three.
Let's take a look at the codecs.py file. This file defines the interface that a standard codec module should have. Its specific content can be found in their own Python distribution, not to repeat this. As defined by the codecs.py file, a complete codec should have at least three classes and a standard function:
1, Codec class
Use:
Used to pass the user's buffer data (a buffer) as a traditional Python string and
Its "decode" is the corresponding Unicode string. A complete codec class definition must provide Codec.decode () and
Codec.encode () Two methods:
Codec.decode(input, errors = "strict")
Used to interpret the input data as a traditional Python string and "decode" it into the corresponding Unicode string.
Parameters:
Input: Buffer (can be a string or any object that can be converted to a string representation)
Errors: Processing selection When a conversion error occurs. The following three types of values can be selected:
Strict (default): Throws a Unicodeerror exception if an error occurs;
Replace: If an error occurs, select a default Unicode encoding instead;
Ignore: If an error occurs, the character is ignored and the remaining characters continue to be parsed.
return value:
A constant list (tuple): The first element is the converted Unicode string, and the trailing element is the length of the input data.
Codec.encode(input, errors = "strict")
Used to consider the input data as a Unicode string and "encode" it into the corresponding traditional Python string.
Parameters:
Input: Buffer entered (usually a Unicode string)
Errors: Processing selection When a conversion error occurs. The value rule is the same as the Codec.decode () method.
return value:
A constant list (tuple): The first element is the converted traditional Python string, and the trailing element is the length of the input data.
2, StreamReader class (usually should inherit from codec Class)
Used to parse the file input stream. Provides all read operations to file objects, such as the ReadLine () method.
3, StreamWriter class (usually should inherit from codec Class)
Used to parse the file output stream. Provides all write operations to file objects, such as the WriteLine () method.
5. Getregentry () function
"Get REGistry ENTRY" means to get the four key functions defined in each codec file. Its function body is unified as follows:
Def getregentry (): Return tuple (Codec (). Encode,codec (). Decode,streamreader,streamwriter)
Of all the four classes mentioned above, only the codec class and the Getregentry () function are actually required. The former must be provided because it is the module that actually provides the conversion operation, and the latter is the standard interface that the Python system obtains codec definition, so it must exist. As for StreamReader and StreamWriter, it should theoretically be possible to inherit the StreamReader and StreamWriter classes in codecs.py and use their default implementations. Of course, there are many codec that have rewritten these two classes to implement some special custom functions.
Next we talk about the encodings directory. As the name implies, the encodings directory is where the Python system defaults to storing all installed codec. Here we can find all the codecs that come with the Python release. In practice, every new codec will install itself here. It is important to note that the Python system does not really require that all codec be installed here. The user can place the new codec in any place they like, as long as the search path of the Python system can be found on the line.
It's not enough to simply install your own codec in a path that Python can find. In order for the Python system to find the corresponding codec, it must also be registered in Python. To register a new codec, you must use the aliases.py file in the Encodings directory. This file only defines a hash table aliases, each of which corresponds to the name of each codec in use, that is, the second parameter value of the Unicode () built-in function, and the value corresponding to each key is a string, which is the module name of the file that the codec corresponds to. For example, Python's default parsing UTF-8 codec is utf_8.py, which is stored in the Encodings subdirectory, and there is an entry in the aliases hash table that indicates its corresponding relationship:
'utf-8' : 'utf_8', # the module `utf_8' is the codec for UTF-8
Similarly, if we write a new codec that parses the ' mycharset ' character set, assuming that its encoded file is mycodec.py and stored in the $python_home/lib/site-packages/mycharset/directory, Then we have to add this line to the Aliases hash table:
'mycharset' : 'mycharset.mycodec',
There is no need to write out the full pathname of the mycodec.py because the Site-packages directory is usually in the search path of the python system.
The Python interpreter automatically loads the aliases.py file in the encodings directory when it needs to parse a Unicode string. If Mycharset has already been registered in the system, then we can use our own defined codec as with other built-in encodings. For example, if the mycodec.py is registered as above, then we can write it this way:
my_unicode_string = Unicode (a_multi_byte_string, ' Mycharset ') print my_unicode_string.encode (' Mycharset ')
Now we can summarize the steps needed to write a new codec altogether:
First, we need to write our own codec encoding/decoding module;
Secondly, we will put this module file in a Python interpreter can find the place;
Finally, we want to register it in the encodings/aliases.py file.
Theoretically, with these three steps, we can install our own codec into the system. But this is not the end, there is a small problem. Sometimes, for a variety of reasons, we do not want to modify their own system files (for example, a user working in a centralized system, the system administrator does not allow others to modify the system files). In the steps described above, we need to modify the contents of the aliases.py file, which is a system file. But if we can't modify it, can't we just add a new codec? No, of course we have a way.
The idea is to modify the contents of the encodings.aliases.aliases hash table at run time.
Or using the assumption above, if the administrator of the user's working system does not allow the user to write mycodec.py registration information to aliases.py, then we can do so:
1, put mycodec.py in a directory, such as/home/myname/mycharset/directory;
2. Write/home/myname/mycharset/init.py files like this:
Import encodings.aliases# Update aliases hash mapencodings.aliases.aliases.update ({/' Mycodec ': ' Mycharset.mycodec ',/ }}
Each time we use Python in the future, we can add/home/myname/to the search path and execute it in advance with our own codec:
import mycharset # execute the script in mycharset/init.py
This allows us to use the new codecs without altering the original system files. In addition, with the help of the Python site mechanism, we can also automate the import process. If you don't know what site is, run it in your own Python interactive environment:
Import Siteprint Site.doc
Browse through the documentation of the site module to understand the techniques. If you have Red Hat Linux v8,v9 in hand, you can also refer to the Japanese codec included with the Red Hat Python release to see how it is automatically loaded. Perhaps a lot of colleagues may not find this Japanese codec where, listed here as follows:
Red Hat Linux V8: under the/usr/lib/python2.2/site-package/japanese/directory; Red Hat Linux v9: in/usr/lib/python2.2/lib-dynload/ Under the japanese/directory;
Tip: Please red Hat users notice the japanese.pth file in the Site-packages directory, combined with the site module document, I believe immediately can suddenly enlightened.
Conclusion
I remember the author in the Dohao Forum bombast: "If you can, I can write a (Chinese module)," Now in retrospect, can not help for their original uppity and ashamed. One to spend all their time on the study, a semester only to learn seven courses, but also ended up a two-course failure of the idiot graduate student, where there is any qualification in front of everyone so arrogant. Now, the second semester due to the two courses for the sake of the burden of a steep (10 courses! ), the family Father mother also eagerly waiting for their son can give them to earn face. For a limited period of time, both to ensure the study, but also to ensure that the work (I have to undertake the tutor's course counseling work, while there is a school teaching reform program needs me in the burden), is tired to cope with, plus a Chinese module ... Alas, please forgive me persisting, I have to break my promise.
Therefore, the author ventured, here to tell their own experience of the past six months, only hope to find a batch, not even a good, as long as the project is interested in the same fellow, can take the author has collated the knowledge, to a complete (at least should contain GB, BIG5, I personally think that even should include the Hz code of the Chinese module written out, contribute to everyone (whether paid or unpaid), that is the blessing of our vast number of Python enthusiasts. In addition, Python's distribution does not yet include any Chinese support modules. Now that I'm waiting for Python, if our work can make a contribution to Python's development, why not?
Appendix A few small tips
1, LUO Jian has written a very good Chinese module (Dohao has a link, the file name is Showfile.zip, this module than I have written the draft version is much faster), while supporting GB2312 and GB18030 code, unfortunately does not support BIG5. If you are interested, you can download this module to study;
2, compared with other character set encoding, the Chinese module has its particularity, that is its massive number of characters. Some relatively small character sets are good to say, such as GB2312, which can be found using a hash table. And for a huge GB18030 code, if you simply make all the data into an oversized coding table, the query will be slow to tolerate (the author is the most headache in writing the module is this point). If you want to write a codec that is fast and satisfying, you must consider designing a formula that can be extrapolated from one encoding to another by simply doing it, or at least calculating its approximate range. This requires the programmer to be able to make statistics on the entire coding scheme, and try to find patterns. The author thinks that this should be the biggest difficulty in writing the Chinese module. Perhaps the mathematical foundation is too poor, the author has not been able to find a law. Hope to have a master of mathematics, the liberal enlighten;
3, the Chinese code is divided into two major factions: GB and BIG5. Where GB is divided into GB2312, GBK and, GB18030 three kinds of coding, and BIG5 is divided into BIG5 and Big5-hkscs two (respectively, corresponding to the original BIG5 and Hong Kong extended version). Although the same faction of the code can be backwards-compatible, but considering its large number of characters, in order to speed up the search speed, I personally think it is more reasonable to encode them separately. Of course, if you can find the corresponding character set of the conversion formula, then this separation is not necessary;