Unicode and Python Chinese Processing

Source: Internet
Author: User
Tags gtk

Chinese processing of Unicode and Python

http://blog.csdn.net/tingsking18/archive/2009/03/29/4033645.aspx

Uincode string processing has always been a confusing issue in the Python language. Many Python enthusiasts are often troubled by confusing the differences between Unicode, UTF-8, and many other encodings. The author used to be a member of this "brain-wrenching family", but after more than half a year of hard work, I have finally figured out some of these relationships. It is now organized as follows and shared with you colleagues. At the same time, I also hope that I can use this essay to draw inspiration and attract more real masters to join us to jointly improve our Python Chinese environment.

The various views mentioned in this article are partly obtained by consulting the data, and partly obtained by the author using the "guess plus verification" method using the existing various encoded data. The author has only learned a little by asking himself, but I am afraid that there are many mistakes. There is no shortage of masters among the officials. If any one finds something wrong, I hope you will let me know. The author loses his ugliness and his mistakes, and mistakes in other people ’s affairs, so you do n’t have to worry about the author ’s face.

Section 1 text encoding and Unicode standards

To explain Unicode strings, we must start with what is Unicode encoding. As we all know, text display has always been a basic problem that computer display functions must solve. The computer is not literate, it actually treats the text as a series of "pictures", each "picture" corresponds to a character. When each computer program displays text, it must use a "picture" collection that records how the text "picture" is displayed, find the data for each character corresponding to the "picture", and "draw" the word according to the same pattern. To the screen. This "picture" is called a "character pattern", and the set of recorded font display data is called a "character set." In order to facilitate the program search, the font data of each character must be ordered in the character set, and each character will be assigned a unique ID, which is the character encoding. When the computer performs character data processing, this code is always used to represent the character it represents. Therefore, a character set specifies a set of character data that a computer can process. Obviously, the size of the character set specified in different countries is different, and the corresponding character encoding is also different.

In the history of computers, the most widely used standardized character set is the first to adopt the ASCII character set. It is actually a standard developed by the United States and developed for North American users. It uses 7 binary bits to encode and can represent 128 characters. This character set was finally officially adopted as an international standard by the ISO organization, and is widely used in various computer systems. Nowadays, the BIOS of all PCs contains fonts of the ASCII character set, which can be seen deeply in people's hearts.

However, when the computer was widely popularized in various countries, the limitations of ASCII encoding were exposed: its character space is really limited and cannot accommodate more characters, but the number of characters required by most languages is far more than 128 Pcs. In order to handle the national characters correctly, the official or non-governmental organizations of various countries have begun to design their own national character code sets, and finally many character codes for various national characters emerged, such as the ISO-8859-1 code for Western European characters GB series codes for Simplified Chinese, as well as SHIFT-JIS codes for Japanese and so on. At the same time, in order to ensure that each new character set can be compatible with the original ASCII text, most of the character sets invariably use the ASCII characters as their first 128 characters, and make their codes correspond to the ASCII codes one by one.

In this way, the problem of displaying the characters in various countries is solved, but it brings a new problem: garbled characters. The character sets used in different countries and regions are usually not restricted by uniform specifications, so the encoding of each character set is often incompatible with each other. The encoding of the same word in two different character sets is generally different; the characters of the same encoding in different character sets are also different. A piece of text written with code A is often displayed as a mess of characters on a system that only supports code B. What's worse is that the encoding lengths used by different character sets are often different. Programs that can only handle single-byte encodings often encounter double-byte or even multi-byte encoded text when they are not processed correctly. The infamous "half word" problem. This made the already chaotic situation even more chaotic.

In order to solve these problems once and for all, many large companies and organizations in the industry jointly proposed a standard, which is Unicode. Unicode is actually a new character encoding system. It encodes each character in the character set with a two-byte ID number, thereby specifying a coding space that can accommodate up to 65536 characters, and it also contains all the commonly used words in the current international encoding. . Due to careful consideration when designing the encoding, Unicode solves the problem of garbled characters and "half a word" in other character sets during data exchange. At the same time, the designers of Unicode fully considered the fact that a large amount of font data is still using various encodings formulated by various countries, and put forward the design concept of "using Unicode as an internal encoding". In other words, the character display program still uses the original encoding and code, and the internal logic of the application will use Unicode. When displaying text, the program always converts the Unicode-encoded character string to the original encoding for display. In this way, you don't have to redesign the font data system in order to use Unicode. At the same time, in order to distinguish it from the encodings that have been formulated by various countries, the designers of Unicode call Unicode "wide character encodings" (wide character encodings), and the encodings developed by various countries are traditionally called "multi-byte encodings" (multi bypes encodings). Today, the Unicode system has introduced a four-byte extended encoding, and gradually merged with UCS-4, which is the ISO10646 encoding specification, hoping to one day be able to use the ISO10646 system to unify all text encodings in the world.

The Unicode system has high hopes as soon as it was born, and was quickly accepted as an international standard recognized by ISO. However, in the promotion process, Unicode was met with opposition from users in Europe and America first. The reason for their objection is very simple: the original encoding used by European and American users is single-byte long, and the double-byte Unicode processing engine cannot process the original single-byte data; if you want to convert all existing single-byte text In Unicode, the workload is too large. Furthermore, if all single-byte encoded text is converted to double-byte Unicode, the space occupied by all their text data will be doubled, and all processing programs will have to be rewritten. They cannot accept this cost.

Although Unicode is an internationally recognized standard, the standardization organization cannot ignore the requirements of the largest computer user group in Europe and the United States. So under the negotiation of all parties, a variant version of Unicode was produced, which is UTF-8. UTF-8 is a multi-byte encoding system, and its encoding rules are as follows:

1. UTF-8 encoding is divided into four areas:
One zone is single-byte encoding,
The coding format of is: 0xxxxxxx;
Corresponding Unicode: 0x0000-0x007f
The second area is a double-byte encoding,
The coding format of is: 110xxxxx 10xxxxxx;
Corresponding Unicode: 0x0080-0x07ff
The three areas are encoded with three bytes,
The coding format of is: 1110xxxx 10xxxxxxx 10xxxxxx
Corresponding Unicode: 0x0800-0xffff
The four areas are four-byte codes,
The coding format of is: 11110xxx 10xxxxxxx 10xxxxxx 10xxxxxx
Corresponding Unicode: 0x00010000-0x0001ffff
The five zones are encoded with five bytes,
The coding format of is: 111110xx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx
Corresponding Unicode: 0x00200000-0x03ffffff
The six zones are encoded with six bytes,
The coding format is: 111110x 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx
Corresponding Unicode: 0x04000000-0x7fffffff

Among them, the first, second, and third areas correspond to the Unicode double-byte encoding area, and the fourth area is for the four-byte extension of Unicode (according to this definition, UTF-8 has five areas and six areas, but the author does not Found in the GNU glibc library, somehow);

2. The areas are arranged in the order of one, two, three, four, five, and six, and the characters at the corresponding positions remain the same as Unicode;

3. The non-displayable Unicode characters are encoded as 0 bytes. In other words, they are not included in UTF-8 (this is what the author got from the GNU C library comments and may not be consistent with the actual situation);

According to UTF-8 encoding rules, it is not difficult to find that the 128 codes in one area are actually ASCII codes. So UTF-8's processing engine can directly process ASCII text. However, UTF-8's compatibility with ASCII encoding comes at the expense of other encodings. For example, the original Chinese, Japanese, and Korean characters were basically double-byte encodings, but their positions in the Unicode encoding correspond to the three areas in UTF-8, and each character encoding is three bytes long. In other words, if we convert all existing non-ASCII text data encoded in China, Japan, and Korea to UTF-8 encoding, the size will become 1.5 times the original size.

Although the author personally believes that UTF-8 encoding is somewhat unfair, it has solved the problem of the transition from ASCII text to the Unicode world, so it has won wide recognition. Typical examples are XML and Java: the default encoding of XML text is UTF-8, and Java source code can actually be written in UTF-8 characters (JBuilder users should have an impression). There is also the famous GTK 2.0 in the open source software world, which uses UTF-8 characters as internal encoding.

Having said so much, it seems that the topic is a bit far away. Many Python enthusiasts may have begun to worry: "What does this have to do with Python?" Okay, now we turn our attention to the world of Python.

Section 2 Python's Unicode encoding system

In order to correctly handle multilingual text, Python introduced Unicode strings after version 2.0. Since then, strings in the Python language have been divided into two types: one is the traditional Python string that has been used for a long time before version 2.0, and the other is the new Unicode string. In the Python language, we use the unicode () built-in function to "decode" a traditional Python string to get a Unicode string, and then "encode" this Unicode string through the encode () method of the Unicode string , "Encode" it into a traditional Python string. The above content must be familiar to every Python user. But as you know, Python's Unicode strings are not really "Unicode-encoded strings", but follow one's own unique rules. The content of this rule is very simple:

1. The Python Unicode encoding of ASCII characters is the same as their ASCII encoding. In other words, the ASCII text in Python's Unicode string is still single-byte length encoded;

2. For characters other than ASCII characters, the encoding is the double-byte (or four-byte) encoding of the Unicode standard encoding.
(The author conjectures that the reason why the Python community wants to develop such a weird standard may be to ensure the universality of ASCII strings)

Usually in Python applications, Unicode strings are used as internal processing, and the terminal display work is done by traditional Python strings (in fact, Python's print statement cannot print double-byte Unicode encoded characters at all). In the Python language, traditional Python strings are so-called "multi-byte encoding" strings, which are used to represent various strings that are "encoded" into specific character set encodings (such as GB, BIG5, KOI8-R, JIS, ISO-8859-1, of course, UTF-8); Python Unicode strings are "wide character encoding" strings, which represent Unicode data "decoded" from the specific character set encoding. So under normal circumstances, a Python application that needs to use Unicode encoding will often process string data as follows:

def foo (string, encoding = "gb2312"):
# 1. convert multi-byte string to wide character string
u_string = unicode (string, encoding)

# 2. do something
...

# 3. convert wide character string to printable multi-byte string
return u_string.encode (encoding)
 

We can give an example: often in RIn the ed Hat Linux environment, Python using PyGTK2 for XWindow programming may have already discovered such a situation: if we directly write the following statement:

import pygtk
pygtk.require ('2.0')
import gtk

main = gtk.Window () # create a window
main.set_title ("Hello") # NOTICE!
 

Such a warning will appear on the terminal when such a statement is executed:
Error converting from UTF-8 to 'GB18030': An invalid character sequence appears in the conversion input and the program window title will not be set to "Hello"

But if the user installs Chinese codec, and change the last sentence of the above to:

u_string = unicode ('Hello', 'gb2312')
main.set_title (u_string)
The title of the program window will be correctly set to "Hello". Why is this?

  the reason is simple. The gtk.Window.set_title () method always treats the title string it receives as a Unicode string. When the PyGTK system receives the user's main.set_title () request, it will process the obtained string somewhere as follows:

class Window (gtk.Widget):
...
def set_title (self, title_unicode_string):
...
# NOTICE! Unicode-> multi-byte utf-8
real_title_string = title_unicode_string.encode ('utf-8')
...
# pass read_title_string to GTK2 C API to draw the title
...
 

We see that the string title_unicode_string is "encoded" inside the program into a new string: real_title_string. Obviously, this real_title_string is a traditional Python string, and its encoding is UTF-8. In the previous section, the author once mentioned that the strings used in GTK2 are all encoded in UTF-8. Therefore, the GTK2 core system can correctly display the title after receiving real_title_string.

So, what if the title entered by the user is an ASCII string (for example: "hello world")? Recalling the definition rules of Python Unicode strings, it is not difficult to find that if the user's input is an ASCII string, it is itself re-encoded. In other words, if the value of title_unicode_string is an ASCII string, the values of real_title_string and title_unicode_string will be exactly the same. An ASCII string is also a UTF-8 string, and passing it to the GTK2 system will have no problems.

The example we gave above is about PyGTK2 under Linux, but similar problems not only appear in PyGTK. In addition to PyGTK, today's various Python-bound graphics packages, such as PyQT, Tkinter, etc., will more or less encounter problems related to Unicode processing.

Now we have figured out Python's Unicode string encoding mechanism, but the problem we most want to know is still not resolved: how can we make Python support Unicode for Chinese? We will explain this problem in the next section.

Section 3 How to make Python's Unicode strings support Chinese

After reading the title of this section, some Python colleagues may be somewhat disappointed: "Why do we have to use Unicode to deal with Chinese? We usually use traditional Python string processing is not good?" Indeed, in fact, in general Traditional Python strings are sufficient for operations like string concatenation and substring matching. However, if it involves some advanced string operations, such as regular expression matching, text editing, expression analysis, etc. that contain multiple languages, these operations that mix a lot of single-byte and multi-byte text if traditional strings are used Handling is very troublesome. Besides, the traditional string has never been able to solve the damn "half word" problem. If we can use Unicode, these problems can be solved. Therefore, we must face squarely and try to solve the problem of handling Chinese Unicode.

From the introduction in the previous section, we know that if you want to use Python's Unicode mechanism to process strings, as long as you can have a multi-byte Chinese encoding (including GB encoding series and BIG5 series) and Unicode encoding bidirectional conversion encoding / Decoding module is enough. In Python terminology, such an encoding / decoding module is called a codec. So the next question becomes: how do we write such a codec?

If Python's Unicode mechanism is hard-coded in the core of Python, then adding a new codec to Python will be an arduous job. Fortunately, the designers of Python are not so stupid, they provide a mechanism with excellent scalability, you can easily add new codecs to Python.

Python's Unicode processing module has three most important components: one is the codecs.py file, the other is the encodings directory, and the third is the aliases.py file. The first two are located in the installation directory of the Python system library (if it is a Win32 distribution, it is in the $ PYTHON_HOME / lib / directory; if it is Red Hat Linux, it is in the / usr / lib / python-version / directory, Other systems can be found accordingly), and the last one is located in the encodings directory. Next, we will explain these three separately.

First look at the codecs.py file. This file defines the interface that a standard Codec module should have. You can find its specific content in your Python distribution, and I wo n’t go into details here. According to the definition of the codecs.py file, a complete codec should have at least three classes and a standard function:

1. Codec

Usage:
Is used to take the buffer data (a buffer) passed in by the user as a traditional Python string and "decode" it into the corresponding Unicode string. A complete Codec class definition must provide Codec.decode () and Codec.encode () two methods:

Codec.decode (input, errors = "strict")
Is used to treat the input data as a traditional Python string, and "decode" it into a corresponding Unicode string.

Parameters:

Input: Input buffer (can be a string or any object that can be converted into a string representation)
Errors: Processing choice when a conversion error occurs. The following three values can be selected:
Strict (default): If an error occurs, a UnicodeError exception is thrown;
Replace: If an error occurs, select a default Unicode encoding instead;
Ignore: If an error occurs, ignore this character and continue to analyze the remaining characters.

  return value:
A constant list (tuple): the first element is the converted Unicode string, and the tail element is the length of the input data.

Codec.encode (input, errors = "strict")
Is used to treat the input data as a Unicode string and "encode" it into a corresponding traditional Python string.

Parameters:

Input: input buffer (usually Unicode string)
Errors: processing choice when conversion error occurs. The value rule is the same as Codec.decode () method.

  return value:
A constant list (tuple): the first element is the converted traditional Python string, and the tail element is the length of the input data.

2. StreamReader class (usually should inherit from Codec class)

Is used to analyze the file input stream. Provides all read operations on file objects, such as the readline () method.

3. StreamWriter class (usually should inherit from Codec class)

Used to analyze the file output stream. Provide all write operations to file objects, such as the writeline () method.

5, getregentry () function

Stands for "GET REGistry ENTRY" and is used to obtain the four key functions defined in each Codec file. The function body is unified as:

def getregentry ():
return tuple (Codec (). encode, Codec (). decode, StreamReader, StreamWriter)
Of all the four classes mentioned above, only the Codec class and the getregentry () function are actually required. The former must be provided because it is the module that actually provides the conversion operation; the latter is the standard interface defined by the Python system to obtain Codec, so it must exist. As for StreamReader and StreamWriter, in theory, it should be possible to inherit the StreamReader and StreamWriter classes in codecs.py and use their default implementations. Of course, there are also many codecs that have rewritten these two classes to achieve some special custom functions.

Next we will talk about the encodings directory. As the name suggests, the encodings directory is where the Python system stores all installed codec by default. We can find the codecs that come with all Python distributions here. Traditionally, every new codec will install itself here. It should be noted that the Python system does not actually require all codec to be installed here. Users can put the new codec in any position they like, as long as the search path of the Python system can be found.

It is not enough to install the codec you wrote in the path that Python can find. For the Python system to find the corresponding codec, it must also be registered in Python. To register a new codec, you must use the aliases.py file in the encodings directory. Only one hash table aliases is defined in this file, each key of it corresponds to the name of each codec in use, that is, the second parameter value of the unicode () built-in function; and the value corresponding to each key It is a string, which is the module name of the processing file corresponding to this codec. For example, Python's default codec for parsing UTF-8 is utf_8.py, which is stored in the encodings subdirectory, and there is an entry in the aliases hash table to indicate its correspondence:

'utf-8': 'utf_8', # the module `utf_8 'is the codec for UTF-8
 

Similarly, if we write a new codec that parses the 'mycharset' character set, assuming that the encoding file is mycodec.py, which is stored in the $ PYTHON_HOME / lib / site-packages / mycharset / directory, we must be in aliases. Hope to add such a line in the table:

'mycharset': 'mycharset.mycodec',
It is not necessary to write the full path name of mycodec.py here, because the site-packages directory is usually in the search path of the Python system.

When the Python interpreter needs to analyze Unicode strings, it will automatically load the aliases.py file in the encodings directory. If mycharset is already registered in the system, then we can use our own defined codec like other built-in encodings. For example, if mycodec.py is registered in the above way, we can write:

my_unicode_string = unicode (a_multi_byte_string, 'mycharset')
print my_unicode_string.encode ('mycharset')
Now we can summarize the total steps required to write a new codec:

First, we need to write our own codec encoding / decoding module;

Second, we want to put this module file in a place where a Python interpreter can find it;

Finally, we need to register it in the encodings / aliases.py file.

Theoretically, with these three steps, we can install our own codec into the system. But this is not finished yet, there is a small problem. Sometimes we come outFor various reasons, I do not want to modify my own system files (for example, a user works in a centralized system, and the system administrator does not allow others to modify the system files). In the steps described above, we need to modify the contents of the aliases.py file, which is a system file. But if we can't modify it, can't we add new codec? No, we certainly have a solution.

This method is: modify the contents of the encodings.aliases.aliases hash table at runtime.

Still use the above assumption, if the administrator of the user's work system does not allow the user to write the registration information of mycodec.py to aliases.py, then we can do this:

1. Put mycodec.py in a directory, such as / home / myname / mycharset / directory;

2. Write the /home/myname/mycharset/__init__.py file like this:

import encodings.aliases

# update aliases hash map
encodings.aliases.aliases.update ({\
'mycodec': 'mycharset.mycodec', \
}}
Each time we want to use Python in the future, we can add / home / myname / to the search path, and execute in advance when using our own codec:

import mycharset # execute the script in mycharset / __ init__.py
 

This way we can use the new codecs without changing the original system files. In addition, if we use Python's site mechanism, we can also automate this import. If you do n’t know what a site is, please run it in your Python interactive environment:

import site
print site .__ doc__
Browse the documentation of the site module, you can understand the skills. If you have Red Hat Linux v8, v9 on hand, you can also refer to the Japanese codec included in the Red Hat Python distribution to see how it is automatically loaded. Perhaps many of you may not find where the Japanese codec is. Here is the list:

Red Hat Linux v8: in the /usr/lib/python2.2/site-package/japanese/ directory;
Red Hat Linux v9: in the /usr/lib/python2.2/lib-dynload/japanese/ directory;
 

Tip: Please pay attention to the Japanese.pth file in the site-packages directory of Red Hat users, combined with the site module documentation, I believe it will be suddenly bright.

Conclusion

I remember that the author boasted Haikou on the Dohao forum: "If I can, I can write a (Chinese module) for everyone." Looking back now, I can't help but be ashamed of my original ignorance. A fool graduate student who spends all his time studying, only seven courses in a semester, and two failed courses. What qualifications are so arrogant in front of everyone. Nowadays, in the second semester, the burden of the two courses has increased sharply (ten courses!), And the father and mother in the family are still waiting for their son to win their faces. If I want to guarantee learning and work within a limited time (I have to undertake the tutoring work of the tutor, and at the same time there is a school's teaching reform program that requires me to pick the beam), I am already tired of coping, plus The last Chinese module ... Alas, please forgive the author for the lack of avatars and have to say something.

Therefore, the author has the courage to share his experience over the past six months, only hoping to find a batch, no, even one, as long as it is a fellow Chinese who is interested in this project, can pick up what the author has sorted out Knowledge, write a complete (at least it should include GB, BIG5, I personally think that it should even include HZ code) written out of the Chinese module, and contribute to everyone (whether paid or unpaid), that is our majority of Python enthusiasts Blessed. In addition, the Python distribution has not yet included any Chinese support modules. Since I love Python everyday, if our work can contribute to the development of Python, why not do it?

Appendix A few tips

1. Brother LUO Jian has written a very good Chinese module (the link on Dohao, the file name is showfile.zip, this module is much faster than the draft version I have written), and supports GB2312 and GB18030 encoding, Unfortunately, BIG5 is not supported. If you are interested, you can download this module to study it;

2. Compared with other character set encodings, the Chinese module has its special characteristics, that is, its massive number of characters. Fortunately, some relatively small character sets, such as GB2312, can use hash table lookup. For the huge GB18030 code, if you simply make all the data into an extra large code comparison table, the query speed will be unacceptably slow (this is the most headache when I write the module). If you want to write a codec that is satisfactory in terms of speed, you must consider designing a formula that can be derived from one code by another, or at least the approximate range of it. This requires the programmer to be able to make statistics on the entire coding scheme and try to find the law. The author believes that this should be the biggest difficulty when writing Chinese modules. Perhaps it is because the mathematics skills are so poor that the author has failed to find a pattern after struggling. I hope that some mathematics masters can teach me;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.