How to solve Chinese garbled problem when Python reads and writes CSV

Last Update:2018-02-08 Source: Internet

Author: User

Tags function prototype

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Https://www.cnblogs.com/shengulong/p/7097869.html

Reference 1

Reference 2

Reference 3

CSV is an abbreviation of the English comma separate values (comma separated value), as the name implies, the contents of the document are composed of "," delimited columns of data, which can be opened using Excel and a text editor. CSV document is an easy-to-edit, visually stunning way to store data

1, Python read-write, append csv method:

' R ': Read-only (default. Throws an error if the file does not exist)
' W ': Write-only (if the file does not exist, the file is created automatically)
' A ': Append to end of file (automatically create file if file does not exist)
' r+ ': Read and write (throws an error if the file does not exist)

1 Import Csv,os2 if Os.path.isfile (' Test.csv '): 3     with open ("Test.csv", "R") as Csvfile:4         reader = Csv.reader ( CSVFile) 5         #这里不需要readlines6 for the         reader:7             print Line

Import Csv#python2 can be replaced with file open# does not exist and will create files with open ("Test.csv", "W") as CSVFile:     writer = csv.writer (csvfile)    #先写入columns_name    Writer.writerow (["Index", "A_name", "B_name"])    #写入多行用writerows    writer.writerows ([[ 0,1,3],[1,2,3],[2,3,4]])

Import Csv#python2 can be replaced with file open# does not exist then creates the files with open ("Test.csv", "a") as CSVFile:     writer = csv.writer (csvfile)    #先写入columns_name    Writer.writerow (["Index", "A_name", "B_name"])    #写入多行用writerows    writer.writerows ( [[0,1,3],[1,2,3],[2,3,4]])

2, Excel open CSV file, can identify the code "GB2312", but do not recognize "Utf-8", the database string encoding is utf-8. therefore:

When reading data from CSV to the database, you need to convert the GB2312 to Unicode encoding before converting the Unicode encoding to the UTF-8 encoding: Data.decode (' GB2312 '). Encode (' Utf-8 ')

When reading data from a database to a CSV file, you need to convert the Utf-8 encoding to Unicode encoding before converting the Unicode encoding to the GB2312 encoding: Data.decode (' Utf-8 '). Encode (' GB2312 ‘)

3, decode (' utf-8 ') means to convert UTF-8 encoding to Unicode encoding; encode (' utf-8 ') means converting Unicode encoding to UTF-8 encoding

4, Unicode is just a set of symbols, it specifies the binary code of the symbol, but does not specify how the binary code stored

5, you can use the Python encoding conversion module: Codecs

1 python Unicode file reads and writes: 2  3 #coding =GBK 4 Import codecs 5  6 f = codecs.open (' C:/intimate.txt ', ' a ', ' utf-8 ') #这里表示把int Imate.txt file is converted from Utf-8 encoding to Unicode, it can be Unicode read-write 7 f.write (U ' Chinese ') #直接写入unicode 8 s = ' Chinese ' 9 f.write (S.decode (' GBK ')) # First decoding GBK's s into Unicode and then writing to the file F.close () one-by-one F = codecs.open (' c:/intimate.txt ', ' r ', ' Utf-8 '), F.readlines () s:16     Print Line.encode (' GBK ')

6. Coding of Python code files

The py file is ASCII encoded by default, and Chinese will make an ASCII-to-system-default-encoding conversion when displayed, and an error will occur: Syntaxerror:non-ascii character. You need to add an encoding indication on the first or second line of the code file:

# coding=utf-8 ##以utf-8编码储存中文字符
print ' Chinese ' as above directly input string is processed according to code file encoding, if Unicode encoding, there are the following 2 ways:
1. S1 = U ' Chinese ' #u表示用unicode编码方式储存信息
2. S2 = Unicode (' Chinese ', ' GBK ')

Unicode is a built-in function, and the second parameter indicates the encoding format of the source string.

Decode is any string that has a method that converts a string into Unicode format, and the parameter indicates the encoding format of the source string.

Encode is also a method of any string that converts a string into the format specified by the parameter.

Encoding of the Python string

The Unicode type is constructed with U ' kanji ', so it is not necessary to construct the STR type.

The coding of STR is related to the system environment, which is generally the value obtained by sys.getfilesystemencoding ().

So to go from Unicode to STR, use the Encode method

Turn Unicode from STR, so use decode

For example:

# coding=utf-8   #默认编码格式为utf -8s = U ' Chinese ' #unicode编码的文字print s.encode (' utf-8 ')   #转换成utf-8 format output print S #效果与上面相同, Appears to be converted directly to the specified encoding by default

My summary:

U=u ' Unicode encoded literal ' g=u.encode (' GBK ') #转换为gbk格式print G #此时为乱码, because the current environment for UTF-8,GBK encoded text is garbled Str=g.decode (' GBK '). Encode (' Utf-8 ')   #以gbk编码格式读取g (because he is GBK encoded) and converted to UTF-8 format output print str #正常显示中文

Secure method:

S.decode (' GBK ', ' ignore '). Encode (' utf-8′ ') #以gbk编码读取 (of course, reading the GBK encoded format) and ignoring the wrong encoding, converting to UTF-8 encoded output

Because the Decode function prototype is decode([encoding], [errors=‘strict‘]) , you can use the second parameter to control the error handling policy, the default parameter is strict, which represents an exception when encountering illegal characters;

If set to ignore, illegal characters are ignored;
If set to replace, it will replace illegal characters;
If set to Xmlcharrefreplace, the character reference of the XML is used.

Unicode (str, ' gb2312 ') is the same as Str.decode (' gb2312 '), which converts gb2312 encoded STR to Unicode encoding

7, code file encoding:

We wrote at the beginning of the. py file: #-*-coding:utf-8-*-claimed that the code file is encoded as Utf-8, and that the writing string in the file is Utf-8 encoded.

8. Get the system code:

Import Sysprint sys.getdefaultencoding ()

9, sys.setdefaultencoding (' utf-8 ') function is to tell the system to automatically decode, that is, automatically complete the conversion of Utf-8 to Unicode encoding



str = ' Chinese ' #这是utf-8 encoded string
Str.encode (' gb18030 ')  #转换为gb18030编码, because it has been automatically decoded, so do not write this style: Str.decode (' Utf-8 '). Encode (' GB18030 ')

10, character encoding judgment:

Law One:
Isinstance (S, str) is used to determine whether a generic string
Isinstance (S, Unicode) is used to determine whether Unicode
Or
If Type (str). __name__!= "Unicode":
Str=unicode (str, "Utf-8")
Else
Pass
Law II:
Python Chardet character encoding judgment
The use of Chardet can be very convenient to implement string/file encoding detection. Especially the Chinese page, some pages use gbk/gb2312, some use UTF8, if you need to crawl some pages, know the page encoding is very important, although the HTML page has charset tag, but sometimes it is wrong. Then Chardet can do us a favor.

Chardet instances
>>> Import Urllib
>>> rawdata = Urllib.urlopen (' http://www.google.cn/'). Read ()
>>> Import Chardet
>>> Chardet.detect (RawData)
{' confidence ': 0.98999999999999999, ' encoding ': ' GB2312 '}
>>>chardet can directly use the Detect function to detect the encoding of the given character. The return value of the function is a dictionary, with 2 meta-numbers, one is the credibility of the detection, and the other is the detected encoding.

Chardet Installation
Pip Install Chardet

How to solve Chinese garbled problem when Python reads and writes CSV

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More