historical annotations. Before Unicode, there was a separate character encoding system for each language, and each system used the same number (0-255) to represent the characters of the language. Some languages (like Russian) have several conflicting criteria for how to represent the same characters, and others (like Japanese) have too many characters and require multiple character sets. It is difficult to document communication between systems because there is no way for a computer to identify which encoding mode the author of the document is using; the computer sees only numbers, and these numbers can represent different things. Then consider trying to store the documents in the same place (for example, in the same database table); You need to save the encoding of the characters next to each piece of text, and make sure that the encoding is passed along with the text. Then consider the multilingual document, which uses characters from different languages in the same document. (more representative is the use of escape characters for mode switching; flutter, we are in the Russian koi8-r mode, so the character 241 represents this; Flutter, now we're in the Mac Greek mode, so character 241 means something else.) Wait a minute. These are the problems that Unicode is designed to solve.
To solve these problems, Unicode represents each character in a 2-byte number, from 0 to 65535. [5] Each 2-byte number represents a single character used in at least one world language. (characters that are used in multiple languages have the same numeric code.) This ensures that each character is a number and one character per digit. Unicode data will never be ambiguous.
Of course, there are still all those legacy coding systems. For example, 7-bit ASCII, which can Fu Cun the English character as a value from 0 to 127. (65 is the capital letter "A", 97 is the lowercase letter "a", and so on.) English has a very simple alphabet, so it can be expressed entirely in 7-bit ASCII. Western European languages, such as French, Spanish, and German, use a coding system called ISO-8859-1 (also known as "latin-1"), which uses 7-bit ASCII characters to represent numbers from 0 to 127, but then expands to a range of 128-255 to represent a wavy line on an image n ( 241), and a character on the U with two dots (252). Unicode uses the same characters as 7-bit ASCII to represent 0 to 127, and the same characters as iso-8859-1 represent 128 to 255, then use the remaining digits, 256 to 65535, to extend to characters representing other languages.
When working with Unicode data, you may need to convert data back to one of these legacy coding systems in some places. For example, to integrate with some other computer systems, these systems expect their data to use a specific single-byte encoding pattern, or to print data to a non-Unicode recognition terminal or printer. or save the data in an XML document that explicitly specifies the encoding pattern.
After understanding this annotation, let's go back to Python.
Starting with version 2.0, Python already supports Unicode on an entire language basis. The XML package uses Unicode to hold all parsed XML data, and you can use Unicode anywhere. example 9.13. Unicode Introduction
>>> s = U ' Dive in ' >>> s
u ' Dive in '
>>> print s Dive in
|
To create a Unicode string instead of the usual ASCII string, precede the string with the letter "U". Note that this particular string does not have any non-ASCII characters. This is good; Unicode is a superset of ASCII (a very large superset), so any normal ASCII can be saved in Unicode. |
|
When you print a string, Python attempts to convert the string to your default encoding, usually ASCII. (There is a more detailed explanation later on.) Because the characters that make up this Unicode string are ASCII characters, the print result is the same as the printed ASCII string; The conversion is seamless, and if you do not notice that S is a Unicode string, you will never notice the difference. |
example 9.14. Store non-ASCII characters
>>> s = U ' La pe/xf1a ' >>> print s traceback (innermost last):
File "<interactive input& gt; ", line 1, in?
UNICODEERROR:ASCII encoding error:ordinal not in range (128)
>>> print s.encode (' latin-1 ') La Peña
|
The real advantage of Unicode is that it retains the ability to save non-ASCII characters, such as the Spanish "n" (with a wavy line on N). The Unicode character encoding used to represent the wave line n is hexadecimal 0xf1 (decimal 241), which you can enter like this:/xf1 |
|
Remember when I said that the print function would try to convert a Unicode string to ASCII so that it could be printed. Well, it won't work here because your Unicode string contains non-ASCII characters, so Python throws a Unicodeerror exception. |
|
This is where Unicode is converted to other encoding modes. S is a Unicode string, but print only prints the normal string. To solve this problem, we call the Encode method (which can be used for each Unicode string) to convert the Unicode string to the normal string in the specified encoding mode. We pass a parameter to this function. In this case, we use Latin-1 (which is known as Iso-8859-1), which includes n with wavy lines (the default ASCII encoding mode, however, is not included because it contains only characters from 0 to 127). |
Remember I said: Once you need to get a normal string from a Unicode, Python usually converts Unicode to ASCII by default. Well, this default encoding mode is an option that can be customized. example 9.15. sitecustomize.py
# sitecustomize.py # This file can is anywhere in your Python path,
# But it usually goes in ${pythondir}/lib/site -packages/
Import Sys
|
Sitecustomize.py is a special script; Python will import it at startup, so any code in it will run automatically. As noted in the note, it can be anywhere (as long as the import can find it), but it is usually located in the Site-packages directory of the Python Lib directory. |
|
Well, the Setdefaultencoding function sets the default encoding. Python uses this encoding pattern wherever it is necessary to automatically cast Unicode strings to normal strings. |
example 9.16. Set the effect of the default encoding
>>> Import sys
>>> sys.getdefaultencoding () ' iso-8859-1 '
>>> s = U ' La pe/xf1a '
>>> print s La Peña
|
This example assumes that you have modified the sitecustomize.py file according to the changes in the previous example and have restarted Python. If your default encoding is ' ASCII ', you may not have set the sitecustomize.py file properly, or you are not rebooting Python. The default encoding will only change when Python starts, and then it can't be changed. (due to some wacky programming skills, I didn't dive in immediately, you can't even invoke the sys.setdefaultencoding function after Python starts.) Study site.py carefully, and search "setdefaultencoding" to find out why. ) |
|
Now that the default encoding mode already contains all the characters you use in the string, Python has no problem with the automatic casting and printing of strings. |
example 9.17. Specify the encoding of the. py file
If you're going to save a non-ASCII string in your Python code, you'll need to add an encoding declaration at the top of each file to specify the encoding for each. py file. This declaration defines the encoding of the. py file as UTF-8:
#!/usr/bin/env python
#-*-Coding:utf-8-*-
Now, think about how the coding in XML should be. The good thing is that each XML document has the specified encoding. To repeat, Iso-8859-1 is a popular coding method for Western European language storage data. Koi8-r is a popular coding method in Russian. The encoding, if specified, is in the header of the XML document. example 9.18. Russiansample.xml
<?xml version= "1.0" encoding= "Koi8-r"?>
<preface>
<title>Предисловие</title>
</preface>
|
This is an example extracted from a real Russian XML document; It is part of the translated version of the book in Russian. Note that the encoding koi8-r is specified in the header. |
|
These are ancient Slavic characters, and as I know them, they are used to spell the Russian word "Preface". If you open the file in a normal text editor, these characters are quite garbled because they are encoded using the KOI8-R encoding mode, but they are displayed in Iso-8859-1 encoding mode. |
example 9.19. Analytic russiansample.xml
>>> from xml.dom import minidom >>> xmldoc = minidom.parse (' Russiansample.xml ') & gt;>> title = Xmldoc.getelementsbytagname (' title ') [0].firstchild.data >>> title U '/u041f/u0440/u0435/u0434/u0438/u0441/u043b/u043e/u0432/u0438/u0435 ' >>> print title
Traceback (innermost): File "<interactive input>", line 1, in? UNICODEERROR:ASCII encoding error:ordinal not in range (128) >>> convertedtitle = Title.encode (' koi8-r ') ;>> convertedtitle '/xf0/xd2/xc5/xc4/xc9/xd3/xcc/xcf/xd7/xc9/xc5 ' >>> print Convertedtitle Предисловие
|
I assume that here you save the previous example in the current directory with the name Russiansample.xml. Also for completeness, I assume that you have deleted the sitecustomize.py file, changed the default encoding back to ' ASCII ', or at least annotated the setdefaultencoding line. |
|
Note the text data for the title tag (now in the title variable, thanks to the constant concatenation of the Python function, I quickly skip it and not explain it before the next section)--In the XM The text data in the title element of the L document is saved in Unicode. It is not possible for |
|
to print title because this Unicode string is a non-ASCII character, so Python cannot convert it to ASCII because it is incomprehensible. |
|
You can, however, explicitly convert it to koi8-r, in this case, we get a (normal, non-Unicode) single-byte character string (F0, D2, C5, etc.), which is the initial U The version of the character koi8-r-encoded in the Nicode string. The |
|
Print Koi8-r encoded string may appear garbled on your screen because your Python IDE parses these characters as iso-8859-1 encoding instead of K OI8-R encoding. However, at least they can print. (And, if you look closely, you'll see the same garbled characters when you open the original XML document in a text editor that does not support Unicode.) When Python parses an XML document, it converts it from Koi8-r to Unicode, and you simply convert it back. ) |
To sum up, if you've never seen Unicode before, it's a bit scary, but it's very easy to process Unicode data in Python. If your XML document is 7-bit ASCII (like the example in this chapter), you almost never have to consider Unicode. Python converts ASCII data in XML documents to Unicode when parsing, and casts back to ASCII whenever needed, and you never even notice. But if you're dealing with data from other languages, Python is ready. Read further Unicode.org is the homepage of the Unicode Standard and contains a brief introduction to the technology. The Unicode tutorial has more examples of how to use Python Unicode functions, including how to cast Unicode to ASCII even when it is not really needed. PEP 263 involves more details about when and how to define characters in your. py file. footnotes
[5] This, unfortunately, is still too simplistic. Unicode has now been extended to handle ancient Chinese, Korean, and Japanese text, which has so many different characters that a 2-byte Unicode system cannot be fully represented. But the current Python does not support out-of-scope encoding, and I don't know if there are any projects that are being planned for resolution. I'm sorry, you've reached the limit of my experience.