Collation of Python Chinese processing documents

Source: Internet
Author: User

1. Python Chinese Processing

From: http://bbs.chinaunix.net/thread-1431029-1-1.html

  1. Python Chinese processing 1. If a Chinese character is used in Python source code, an error occurs during running. The solution is to add a character encoding statement at the beginning of the source code, the following is an example :#! /Usr/bin/ENV Python #-*-coding: cp936-*-Python tutorial indicates that Python source files can be encoded in character sets other than ASCII. The best practice is in #! The line is followed by a special comment line to define the character set: #-*-coding: encoding-*-according to this Declaration, python will try to convert the character encoding in the file into encoding, and, it tries its best to write the specified encoding directly into Unicode text. Note: coding only tells the python file to use the encoding in the encoding format, but the editor may store it in its own way. py file. Therefore, you must select the specified ecoding In the encoding when saving the final file. 2. Storage of Chinese characters> STR = u ""> STR U' \ xd6 \ xd0 \ xce \ xc4 '> STR = "Chinese">> STR '\ xd6 \ xd0 \ xce \ xc4' U "" only declares Unicode, the actual encoding has not changed. This changes: >>> STR = "Chinese" >>> STR '\ xd6 \ xd0 \ xce \ xc4' >>> STR = Str. decode ("gb2312") >>> STR U' \ u4e2d \ u6587 'further: >>> S = 'Chinese' >>> S. decode ('gb2312') U' \ u4e2d \ u6587 '>>> Len (s) 4 >>> Len (S. decode ('gb2312') 2 >>> S = u'chine' >>> Len (s) 4 >>> S = 'Chinese test' >>> Len (s) 8 >>> Len (S. decode ('gb2312') 6 >>> S = 'test in Chinese, '>>> Len (s) 10 >>> Len (S. decode ('gb2312') 7. You can see that python can correctly identify the strings stored in non-ASCII encoding. Chinese characters and punctuation marks in the Chinese context. The prefix "U" indicates "the subsequent string" is a unicode string ". This is just a declaration and does not mean that the string is Unicode; it is like a person who claims that he is 18 years old, but his real age is not sure. Now, age fraud in the sports industry is not rare! So what is the role of U? For python, as long as you declare a string to be UNICODE, it will use a set of Unicode mechanisms to process it. For example, the internal Unicode processing function is used when string operations are performed, and the Unicode character (double byte) is used for saving. And so on. Obviously, for a string that is not actually Unicode, processing Unicode operations may cause problems. The U prefix is only applicable when your String constant is Unicode. III. It is easy to process strings using python for I/O operations on Chinese characters. However, you need to pay attention to some issues when processing Chinese characters. For example: a = "We are a python enthusiast" print a [0] can only output the First Half of the word "I". To output the entire word "I", you also need: B = A [0: 2] print B. It is inconvenient. What should I do if a text is in the same Chinese and English? The best way is to convert to Unicode. Like this: c = Unicode (a, "gb2312") print C [0] at this time, the subscript of C corresponds to every character, no longer a byte, and uses Len (c) you can get the number of characters! It can also be easily converted to other encodings, such as UTF-8: D = C. encode ("UTF-8") 4. <type 'str'> and <type 'unicode '> <type 'str'> consider a string as a byte sequence, <type 'unicode '> considers it as a character sequence. A single character may occupy multiple bytes. Compared with a character, a byte is lower in the storage level. To convert STR to Unicode, decode is required, because the byte sequence must be interpreted as a character sequence, which is the underlying storage mode and decode) to a higher level of characters for use. Similarly, Unicode conversion to STR requires encode, just like the information encoding (encode) before storage: S. decode (encoding) <type 'str'> to <type 'unicode '> U. encode (encoding) <type 'unicode '> to <type 'str'> example: >>> S = 'str' >>> type (s) <type 'str' >>> Type (s. decode () <type 'unicode '>>>> S = u'str' >>> type (s) <type 'unicode' >>>> Type (s. encode () <type 'str'> the following method is recommended when processing Chinese data:: 1. decode early (decode as early as possible, convert the content in the file into Unicode, and then proceed to the next step) 2. unicode everywhere (Unicode is used for internal processing in the program) 3. encode late (encoding required for the last encode return, for example, writing the final result into the result file) is a simple demonstration. query a Chinese string and print it using the re Library: >>> P = Re. compile (Unicode ("test (. *) "," gb2312 ") >>> S = Unicode (" test one, two, three "," gb2312 ") >>> for I in P. findall (s): print I. encode ("gb2312") 1235, cross-platform processing skills if a project must be developed on both platforms, the program should use the same encoding, such as requiring all files to use UTF-8, if it cannot be unified (generally to meet the needs of many The so-called unknown requirements of experts and scholars), you can go back and find the second, using the current system encoding to determine the encoding in the file: import locale import string import re # construct the required encoding value according to the current system's encoding lang = string. upper (locale. setlocale (locale. lc_all, "") textencoding = none # Check whether the encoding value meets our needs if Re. match ("UTF-8", Lang )! = None: # UTF-8 encoding textencoding = "UTF-8" Elif re. match (R "Chinese | cp936", Lang): # in windows, the GB encoding textencoding = "gb18030" Elif re. match (R "gb2312 | GBK | gb18030", Lang): # in Linux, the GB encoding textencoding = "gb18030" else: # in other cases, throw an error. Raise unicodeerror FD = file (filename, "R") fulltextlist = FD. readlines () # convert each row to Unicode for each in Len (fulltextlist): fulltextlist [I] = Unicode (each, textencoding) FD. close () # If you want to print it, you can use text. encoding: * The encode method is used to convert the string to be processed into Unicode using the Unicode function with the correct encoding * uniformly use the Unicode string for operations in the Program, to convert Unicode to the required encoding, note the following: * the so-called "correct" encoding means that the specified encoding must be consistent with the encoding of the string itself. This is actually not so easy to judge. In general, there are two possible encodings for the simplified Chinese characters we directly enter: gb2312 (GBK, gb18030) and UTF-8 * encode cost encoding, must ensure that the destination encoding contains the internal code to convert characters. The encode operation is generally performed by a unicode conversion table corresponding to local encoding. In fact, each local encoding can only be mapped to a part of Unicode. However, the mapped regions are different. For example, the Unicode encoding range corresponding to big-5 is different from that corresponding to GBK (in fact, some of these two encodings overlap ). Therefore, some Unicode characters (such as those converted from gb2312) can be mapped to GBK, but may not be mapped to big-5, if you want to convert to big-5, it is very likely that the encoding cannot be found. But the UTF-8's code table range is actually the same as Unicode (only the encoding form is different), So theoretically, any locally encoded character, both can be converted to UTF-8 * gb2312, GBK, gb18030 is essentially the same encoding standard. The number of characters is expanded only on the basis of the former * The UTF-8 and GB encoding are not compatible

2. codesc In the python module: Natural Language encoding conversion

Transferred from:Http://blog.csdn.net/Zhaoweikid/archive/2007/06/07/1642015 .aSPX

Python supports many languages and can process arbitrary characters. Here, I will take a closer look at how python can process different languages. One thing to note is that when Python needs to perform encoding conversion, it will use internal encoding. The conversion process is as follows: the original encoding-> internal encoding-> Objective encoding python is internally processed using Unicode, but the use of Unicode needs to be considered is its encoding format has two, one is the UCS-2, it has a total of 65536 code bit, the other is the UCS-4, it has 2147483648g code bit. Python supports both formats. This is specified by -- enable-Unicode = ucs2 or -- enable-Unicode = ucs4 during compilation. How can we determine the encoding of Python installed by default? One way is to judge through the value of SYS. maxunicode: Import sysprint SYS. maxunicode if the output value is 65535, It is the UCS-2, and if the output is 1114111, It is the UCS-4 encoding. We need to realize that when a string is converted to an internal encoding, it is not of the STR type! It is of the Unicode type: A = "Wind and rolling cloud" print type (a) B =. unicode (a, "gb2312") print type (B) Output: <type 'str'> <type 'unicode '> at this time, B can be easily converted to other encodings, for example, convert to UTF-8: c = B. encode ("UTF-8") print CC output looks garbled, that's right, because it is a UTF-8 string. Now, let's talk about the codecs module. It is closely related to the concepts I mentioned above. Codecs is used for encoding conversion. Of course, its interface can be used to expand to other code transformations. This is not involved here. #-*-Encoding: gb2312-*-import codecs, sysprint '-' * 60 # create gb2312 encoder look = codecs. lookup ("gb2312") # create a UTF-8 encoder look2 = codecs. lookup ("UTF-8") A = "I Love Tiananmen Square in Beijing" Print Len (a), a # encodes a into an internal Unicode, but why is the method named decode, in my understanding, the gb2312 string is decoded as unicodeb = look. decode (a) # The returned B [0] is the data, B [1] is the length, at this time the type is Unicode print B [1], B [0], type (B [0]) # convert the internal encoding Unicode to a gb2312 encoded string. The encode method returns a string type b2 = look. encode (B [0]) # Find different places? After conversion, the string length is changed from 14 to 7! The returned length is the actual number of words. The original number of bytes is print B2 [1], B2 [0], type (B2 [0]) # although the number of words returned above, but it does not mean that the length of B2 [0] is 7 with Len, and it is still 14, just codecs. encode counts the number of characters in print Len (B2 [0]). The above code is the use of codecs, which is the most common usage. Another problem is, what if the character encoding in the file we process is of another type? This read operation also requires special processing. Codecs also provides methods. #-*-encoding: gb2312-*-import codecs, sys # use the open method provided by codecs to specify the language encoding of the opened file, it will be automatically converted to internal unicodebfile = codecs during read. open ("dddd.txt", 'R', "big5") # bfile = open ("dddd.txt", 'R') Ss = bfile. read () bfile. close () # output. At this time, the converted result is displayed. If you use the built-in OPEN function of the language to open the file, you must see the garbled print SS, type (SS) above for processing big5, you can try to find a big5 encoded file.

3. Research on Chinese Python

Transferred from:Http://blog.chinaunix.net/u2/60332/showart_2109290.html

I have studied the Chinese issue of Java in the Java Chinese issue series. Now, the Chinese issue is no longer a step in the Java World. Recently, I became very interested in Python. Who knows the problem with Chinese? This old friend once again met unexpectedly. It seems that in the Code world, Chinese problems will be inseparable from us for a long time. It is no wonder that we are not the Chinese who invented computers. Otherwise, computers all over the world now support and must support GBK, I am not the one who writes this article, but a kingfa programmer on the other side of the ocean, and the title is changed to studying the English problem in 'python '".. Haha YY, let's face the real problem. Compared with Java, the performance of Chinese problems in python is more intense. "Fierce" means that it is not more serious or difficult to solve, but python uses strict as the default Processing Method for decode and encode errors, that is, it directly reports an error, java uses the replace method for processing. Therefore, a lot "?? ". In addition, Python's default encoding is ASCII, while Java's default encoding is consistent with the operating system's encoding. At this point, I think Java is more reasonable. This is more friendly to programmers and reduces the frustration of newbies at the beginning, which is conducive to language promotion. However, Python also has its own principle. After all, ASCII is the only character set supported by all platforms in the world, and the problem always occurs, it is better to face it earlier than to escape it. Okay. Now, let's talk about the symptoms of Chinese problems in Python. Before that, we should first understand that python has two types of strings, which are general strings (each character is represented by 8 bits) and Unicode strings (each character is expressed in one or more bytes). They can be converted to each other. About Unicode, Joel Spolsky is in the absolute minimum every software developer absolutely, positively must know about Unicode and character sets (no excuses !) Jason orendorff has a more comprehensive description of Unicode for programmers, so I will not talk about anything here. Let's see the following code: x = u "" Print s run the above Code. python will give the following error message syntaxerror: Non-ASCII character '\ xd6' in file G: \ workspace \ chinese_problem \ SRC \ test. PY on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details says it's encountered non-ASCII characters and let's refer to pep-0263. Pep-0263 (Python enhancement proposal) is clear above. python is aware of the internationalization problem and provides a solution. According to the requirements above, we have the following code #-*-coding: gb2312-*-# print "------------- code 1 ------------------" A = "a I Love You" Print aprint. find ("I") B =. replace ("love", "like") print bprint "-------------- Code 2 ----------------" x = "Chinese a I Love You" Y = Unicode (x, "gb2312") print y. encode ("gb2312") print y. find (U "I") z = y. replace (U "", U "") print Z. encode ("gb2312") print "--------------- code 3 ----------------" Print y program runs as follows :-- ----------- Code 1 ---------------- Chinese a I love you 5 Chinese a I like you ------------ Code 2 -------------- Chinese a I love you 3 Chinese a I like your response code 3 -------------- traceback (most recent call last): file "G: \ downloads \ eclipse \ workspace \ P \ SRC \ hello. PY ", line 16, in <module> Print yunicodeencodeerror: 'ascii 'codec can't encode characters in position 0-1: ordinal not in range (128, by introducing the encoding Declaration, we can normally use Chinese characters, and in code 1 and 2, the console can also Print Chinese correctly. However, it is obvious that the above Code also reflects a lot of problems: 1. Code 1 and 2 use different print methods, and 1 is direct print, 2. encode 2, code 1, and 2 Before print, and search for the same character "I" in the same string. The results are different (5 and 3 respectively) 3. An error occurs when Unicode string y is directly printed in code 3 (which is why Code 2 needs to be encoded first). Why? Why? We can first simulate the process of using python in our mind: first, we first write the source code in the editor and save it as a file. If there is an encoding declaration in the source code and the Editor supports this syntax, the file will be stored in the disk in the corresponding encoding method. Note: The encoding Declaration and the source file encoding are not necessarily the same, you can declare the Encoding As a UTF-8 In the encoding declaration, but use gb2312 to save the source file. Of course, it is impossible for us to find our own troubles and write errors intentionally, and a good ide can also ensure consistency between the two. However, if we use Notepad, editplus, and other editors to write code, this problem may occur accidentally. After we get a. py file, we can run it. Here, we will hand over the code to the python parser to complete the parsing. When the parser reads the file, it first parses the encoding declaration in the file. If the file is encoded as gb2312, the content in the file is first converted from gb2312 to Unicode, then convert these Unicode to byte strings in UTF-8 format. After this step is completed, the parser segments and parses these UTF-8 byte strings. If you use a unicode string, use the corresponding UTF-8 byte string to create a unicode string, if the program uses a general string, the parser first converts the UTF-8 byte string to the corresponding encoded byte string through Unicode (here is gb2312 encoding), and creates a general String object with it. That is to say, the Unicode string and the general string in the memory storage format is not the same, the former uses the UTF-8 format, the latter uses the gb2312 format. Now, we know the format of string storage in the memory. Next we need to know how print works. In fact, print is only responsible for handing over the corresponding bytes in the memory to the operating system, so that the corresponding program (such as the CMD window) of the operating system is displayed. There are two cases: 1. If the string is a general string, print only needs to push the corresponding byte string in the memory to the operating system. For example, Code 1. 2. If the string is a unicode string, print will first encode it before pushing: we can show that the Unicode encode method is used for encoding (Code 2 in this example). Otherwise, python uses the default encoding method, that is, ASCII (Code 3 in this example ). Of course, ASCII cannot properly encode Chinese characters, so Python reports an error. So far, we can resolve the first and third problems. As for the second problem, because Python has two types of strings, a general string and a unicode string, both of which have their own character processing methods. For the former, the method is in bytes, and in gb2312, each Chinese Character occupies two bytes, so the result is 5. For the latter, that is, the Unicode string, all characters are regarded in a unified manner, SO 3 is returned. Although the Chinese character of the console program is mentioned above, the Chinese Character Problems in file read/write and network transmission are similar in principle. The emergence of Unicode can solve the problem of internationalization of the software to a large extent. At the same time, Python provides excellent support for Unicode. Therefore, when writing a python program, all use Unicode. Uses the encoding method of the UTF-8 when saving the file. How to Use UTF-8 with Python has a detailed description, you can refer. There are still many Chinese problems in Python, such as file reading and writing and network data transmission. I hope you can communicate more and solve these problems together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.