[Python] Chinese Encoding Problems: Conversion of str, unicode, and UTF-8 in raw_input, file reading, and variable comparison,

Source: Internet
Author: User

[Python] Chinese Encoding Problems: Conversion of str, unicode, and UTF-8 in raw_input, file reading, and variable comparison,

Recently, many search engines, knowledge maps, and Python crawlers have found Chinese Garbled text. Although there are countless articles about Chinese encoding in the market, I have also talked about PHP's problem of Processing Chinese garbled characters on database servers before, but I am still prepared to take a few notes here. For future reference and learning.
The core of Chinese encoding is to ensure that all encoding methods are consistent, including the compiler, database, and browser encoding methods, in Python, unicode is used as the intermediate conversion code for transition. First, convert the string to be processed into a unicode code using the Unicode function with the correct encoding, and use the Unicode string for operations in the program. Finally, use the encode Method for output, convert Unicode to the required encoding, and ensure that the encoding method of the editor server is consistent.
PS: Of course, except Python3! This article is too long-winded. After all, it's online notes and experiences. Hope to understand it ~
Before explaining the concept in detail, let's talk about the two problems of character encoding that I recently encountered and their solutions. Is the most common problem of coding:

>>> Help (unicode) Help on class unicode in module _ builtin __: class unicode (basestring) | unicode (object = '') -> unicode object | unicode (string [, encoding [, errors])-> unicode object | Create a new Unicode object from the given encoded string. | encoding defaults to the current default string encoding. | errors can be 'strict ', 'replace' or 'ignore' and defaults to 'strict '.

For example, you need to determine whether the search term key is in the title.

1 # coding = UTF-8 2 import sys 3 4 def getTitle (key, url): 5 # title = driver. find_element_by_xpath () 6 title = U' famous female broadcaster Miss and livestream LOL '7 print key, type (key) 8 print title, type (title) 9 if key in title: 10 print 'yes' 11 else: 12 print 'no' 13 14 key = raw_input ("Please input a key:") 15 print key, type (key) 16 url = 'HTTP: // www.baidu.com/'17 getTitle (key, url)

Shows the output:

S = 'broadcaster's. decode ('utf-8'). encode ('gb18030 ')

Finally, the solution is obtained from stackoverflow. On the one hand, it indicates that I did not study very deeply, and on the other hand, the Forum is indeed more powerful. Refer:
Python raw-input odd behavior with accents containing strings
It converts the input encoding of the terminal to unicode encoding through decode.
Key = raw_input ("Please input a key:"). decode (sys. stdin. encoding)

# Coding = utf-8import sysimport osimport urllibimport timefrom selenium import webdriverfrom selenium. webdriver. common. keys import Keys import selenium. webdriver. support. ui as ui from selenium. webdriver. common. action_chains import ActionChains # driver = webdriver. phantomJS (executable_path = "G: \ phantomjs-1.9.1-windows \ phantomjs.exe") driver = webdriver. firefox () wait = ui. webDriverWait (driver, 10) def getTitle (line, info): print 'fun: '+ line, type (line) driver. get ("http://baike.baidu.com/") elem_indium = driver. find_element_by_xpath ("// form [@ id = 'searchform']/input") elem_inp.send_keys (line) elem_inp.send_keys (Keys. RETURN) elem_value = driver. find_element_by_xpath ("// div [@ class = 'lemma-summary ']/div [1]"). text print 'summary ', type (elem_value) print elem_value,' \ n' info. write (line. encode ('utf-8') + '\ n' + elem_value.encode ('utf-8') +' \ n') time. sleep (5) def main (): source = open ("E: \ Baidu.txt", 'R') info = open ("E: \ BaiduSpider.txt ", 'W') for line in source: line = line. rstrip ('\ n') print 'main:' + line, type (line) line = unicode (line, "UTF-8") getTitle (line, info) else: info. close () main ()

TXT is generally ANSI encoded by default. The Code steps are as follows:
1.change baidu.txt to UTF-8 encoding, and read str into unicode encoding through unicode (line, 'utf-8;
2. Selenium first searches Baidu encyclopedia by entering the keyword "Beijing na" and crawls the first section of summary of "Forbidden City" Through find_element_by_xpath. The encoding method is unicode;
3. The final file write operation converts unicode to UTF-8 through line. encode ('utf-8'). Otherwise, the UnicodeDecodeError: 'ascii 'error is returned '.
In short, the process is met: encoding = Unicode = Processing = UTF-8 or gbk

Import codecs # use the open method provided by codecs to specify the language encoding of the opened file. It will be automatically converted to internal unicode info = codecs during reading. open (baiduFile, 'w', 'utf-8') # if this method is not io, the line feed is '\ r \ n' info. writelines (key. text + ":" + elem_dic [key]. text + '\ r \ n ')

 

Iii. Unicode explanation

PS: This part mainly refers to Wesley J. Chun, author of "Python core programming (second edition )".
What is Unicode
The Unicode string is declared by the letter "u", which is used to convert a standard string or a string containing Unicode characters into a full Unicode String object. Since Python1.6, Unicode string support has been introduced. It is used to convert multiple double-byte character formats and encodings.
Unicode is a secret weapon used by computers to support multiple languages on the planet. Before Unicode, all ASCII codes are used. Each English character is stored in a computer in the form of a 7-bit binary number. The range is 32 ~ 126. When you type A in A file, the computer writes the ASCII code 65 of A to the disk, and then when the computer reads the file, it first converts 65 to character A and then displays it on the screen.
But its disadvantage is also obvious: For thousands of characters, ASCII is too small. Unicode represents a single character by using one or more bytes, which can contain more than 90,000 characters.

>>> S1 = "Chinese" >>> s1 '\ xd6 \ xd0 \ xce \ xc4' >>> print s1, type (s1) chinese <type 'str' >>>> s2 = u "" >>> s2u' \ xd6 \ xd0 \ xce \ xc4 '>>> print s2, type (s2) öð Î ä< type 'unicode '>>>

The preceding 'U' is declared as a Unicode string, but its actual encoding has not changed.
Encoding and Transcoding
Unicode supports multiple encoding formats, which puts an extra burden on programmers. Whenever you write a string to a file, you must define an encoding (encoding parameter) it is used to convert the corresponding Unicode content to your defined format and implement it through the encode () function. Correspondingly, when we read data from this file, we must "decode" the file, make it a Unicode String object.
Str1.decode ('gb2312') decoding converts a gb2312 encoded string to a unicode encoded string.
Str2.encode ('gb2312') encoding indicates converting a unicode string to gb2312 encoding.

>>> S = 'Chinese' >>> s' \ xd6 \ xd0 \ xce \ xc4 '>>> print s, type (s) chinese <type 'str' >>>> s. decode ('gb2312') U' \ u4e2d \ u6587 '> print s. decode ('gb2312'), type (s. decode ('gb2312') Chinese <type 'unicode '>>>> len (s) 4 >>> len (s. decode ('gb2312') 2 >>> t = u'chine' >>> tu' \ xd6 \ xd0 \ xce \ xc4 '>>> len (t) 4 >>> print t, type (t) öð Î ä< type 'unicode '>>>

 

The prefix 'U' indicates that the string is a Unicode string and is only a declaration.
Unicode Application
1. When a string appears in the program, you must add a prefix u.
2. Do not use the str () function, but use unicode () instead.
3. Do not use outdated string modules-if it is a non-ASCII character, it will mess up everything
4. Do not encode Unicode characters in the program unless necessary. The encode () function is called only when you want to write data to a file, database, or network. Correspondingly, the decode () function is called only when data is read back.
5. Because the pickle module only supports ASCII strings, try to avoid text-based pickle operations.
6. Assume that a Web application using a database to read and write Unicode data must maintain the following Unicode support:
· Database Server (MySQL, PostgreSQL, SQL Server, etc)
· Database adapter (MySQLLdb)
· Web development framework (mod_python, cgi, Zope, Django, etc)
Database makes sure that each table is encoded in UTF-8, And the adapter must be in connect () If Unicode such as MySQLdb is not supported () the method uses a special keyword use_unicode to ensure that the query result is a Unicode string. Mod_python enables Unicode support. You only need to set text-encoding to UTF-8 in the request object. At the same time, the browser also pays attention to it.
Conclusion: Unicode is fully supported by applications. compatibility with other languages is a project. It requires detailed consideration and planning. Check all involved software and systems, including the Python standard library and other third-party extension modules to be used. You even need to component an experienced team to take charge of the I18N issue.

 

Iv. Summary of common Handling Methods

Source: http://xianglong.me/article/learn-python-1-chinese-encoding/
Based on the two problems I encountered, I summarized the following points. Solutions to common Chinese encoding problems include:

1. Follow PEP0263 principles to declare the encoding format
In PEP 0263 -- Defining Python Source Code Encodings, the most basic solution to the Python encoding problem is put forward: declare the encoding format in the Python Source Code file, the most common way to declare:

#!/usr/bin/python# -*- coding: <encoding name> -*-

According to this statement, Python will try to convert the character encoding in the file into encoding. It can be any format supported by Python and generally uses the UTF-8 \ gbk encoding format. In addition, it tries its best to write the specified encoding directly into Unicode text.
Note: coding only tells the Python file to use the encoding in the encoding format, but the editor may store it in its own way. py file. Therefore, you must select the specified ecoding In the encoding when saving the final file.

2. When assigning values to string variables, add the prefix u and replace the Chinese character with the u'chinese'

Str1 = 'Chinese' str2 = u'chinese'

In Python, there are two methods to declare string variables. The main difference is that the encoding format is different. The encoding format of tr1 is the same as that of the Python file declaration, the str2 encoding format is Unicode.
If the string variable you want to declare contains non-ASCII characters, it is best to use the str2 declaration format, so that you do not need to execute decode to directly operate on the string, this prevents exceptions.

3. Reset the default encoding
The root cause of so many Encoding Problems in Python is Python 2. the default encoding format of x is ASCII, so you can modify the default encoding format in the following ways: sys. getdefaultencoding () is 'ascii 'by default.

# Set UTF-8 import sys reload (sys) sys. setdefaultencoding ('utf-8') # display the current default encoding method print sys. getdefaultencoding ()

This method can solve some encoding problems, but it also introduces many other problems, which is not worth the candle. We do not recommend this method.
Principle: first, this is a problem with the Python language itself. Because in Python 2. in the syntax of x, the default str is not really a string we understand, but a byte array, or can be understood as a string consisting of pure ascii characters, it corresponds to the bytes type variable in Python 3, while in the true sense, the common string is a unicode type variable, which corresponds to the str variable in Python 3. It should have been used as the byte array type, but used for string use. This seemingly strange setting is something Python 2 has been criticized for, but there is no way to do it, to maintain compatibility with the previous program ..
As two string types in Python 2, various conversion methods are required between str and unicode. The first is an explicit conversion method, namely encode and decode. Here, the meaning of the two products is easily reversed. The scientific call method is as follows:
Str --- decode method ---> unicode
Unicode --- encode method ---> str

4. Ultimate principle: decode early, unicode everywhere, encode late
Decode early: decode as early as possible, convert the content in the file into unicode, and then proceed to the next step.
Unicode everywhere: unicode is used for internal processing, such as String concatenation, replacement, and comparison.
Encode late: Finally encode back to the required encoding, for example, write the final result into the result File
Processing Python strings according to this principle can basically solve all coding problems (as long as your code and Python environment are correct ). The solution to the two problems mentioned above is also true, but it is just a bit tricky.

5. Use decode (). encode () method
During webpage collection, the Code specifies # coding: UTF-8. If the webpage code is gbk, it needs to be processed as follows:
Html = html. decode ('gbk'). encode ('utf-8 ')

6. input variable raw_input Chinese Encoding
Convert the input code str of the terminal to unicode through decode, and then use unicode for processing:
Key = raw_input ("Please input a key:"). decode (sys. stdin. encoding)

7. file read/write operations
Because the default txt file is ANSI encoding, unicode transcoding is used when reading the file, and the file is in the "encoding =" Unicode = "Processing =" UTF-8 or gbk "order. At the same time, the file output encode ('utf-8') to convert txt to the UTF-8 format. Ultimate Code:
Info = codecs. open (baiduFile, 'w', 'utf-8 ')

8. Upgrade Python 2.x to 3.x.
Last method: Upgrade Python 2.xand use Python 3.x. This is mainly for the encoding Design of Python 2.x. Of course, upgrading to Python 3. x will certainly solve most of the exceptions caused by encoding. After all, Python 3. x makes great improvements to the character string.
In Versions later than Python 3.0, all strings are Unicode-encoded string sequences, and the following improvements are also made:
· Changed the default encoding format to unicode.
· All Python built-in modules support unicode
· The syntax format of u'chinese' is no longer supported
Therefore, for Python 3. x, encoding is no longer a big problem. Basically, few of the above exceptions are encountered.

Summary

Finally, I hope the article will be helpful to you, especially when you encounter this problem. The article is messy because it is based on the recent work, but if you just need it, it can solve your problem.
Gibran once said: "You cannot have both youth and youth knowledge; Because youth is busy with livelihood, you are not busy with learning; and knowledge is busy seeking yourself and you cannot enjoy life ."
I am looking for a job now, and I cannot have a solid basic knowledge while taking into account the depth of project understanding, but I prefer to share knowledge, because it is to seek myself, that is, to enjoy life, is the joy of programming ~

(By: Eastmount http://blog.csdn.net/eastmount)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.