Convert Unicode-encoded TXT files to UTF-8 encoding

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write using codes:

 #  Coding = UTF-8  
  Import String
 Import Codecs
 Def Changecode ():
Tt = codecs. Open ( '  C: \ 111.txt  ' , '  RB  ' , '  UTF-16 ' ) #  111.txt is a unicode encoded file, opened in Unicode encoding, UTF-16 = Unicode  
 Mm = open ( '  C: \ 123.txt  ' , '  WB  ' )
FF = TT. readlines ()

 For I In FF:
 Print I
Mm. Write (I. encode ( '  UTF-8  ' ))
Mm. Write ( '  123  ' )
TT. Close
Mm. Close

 Def Checkyes ():
Nn = open ( '  C: \ 123.txt  ' , '  RB ' )
Nnff = nn. readlines ()
Nn. Close ()
 If Nnff [-1] = '  123  ' :
 Print   "  Finish  " 

Changecode ()
Checkyes ()

Interpretation of codecs: this interpretation is transferred from: http://blog.csdn.net/zhaoweikid/article/details/1642015
Python supports many languages and can process arbitrary characters. Here, I will take a closer look at how python can process different languages.
One thing to note is that when Python needs to perform encoding conversion, it will use internal encoding. The conversion process is as follows:
Original encoding-> internal encoding-> destination Encoding
Python uses Unicode internally, but the use of Unicode needs to consider its encoding format has two, one is the UCS-2, it has a total of 65536 yards, the other is the UCS-4, which has 2147483648g code bit. Python supports both formats. This is specified by -- enable-Unicode = ucs2 or -- enable-Unicode = ucs4 during compilation. How can we determine the encoding of Python installed by default? One way is to judge through the value of SYS. maxunicode:

 ImportSys
PrintSYS. maxunicode

If the output value is 65535, It is the UCS-2, and if the output is 1114111, It is the UCS-4 encoding.

We need to realize that when a string is converted to an internal encoding, it is not of the STR type! It is of the Unicode type:

A ="Disaster recovery cloud"
PrintType ()
B = A. Unicode (,"Gb2312")
PrintType (B)

Output:

At this time, B can easily convert to other encodings, such as UTF-8:

 C = B. encode ("UTF-8")
PrintC

C output looks garbled, that's right, because it is a UTF-8 string.

Now, let's talk about the codecs module. It is closely related to the concepts I mentioned above. Codecs is used for encoding conversion. Of course, its interface can be used to expand to otherCodeThis is not involved here.

 # -*-Encoding: gb2312 -*-  
  Import Codecs, sys

 Print   '  -  ' * 60
 #  Create a gb2312 Encoder  
 Look = codecs. Lookup ( "  Gb2312  " )
 # Create a UTF-8 Encoder  
 Look2 = codecs. Lookup ( "  UTF-8  " )

A = "  I Love Tiananmen, Beijing  " 

 Print Len (a),
 #  Encode a as an internal Unicode, but why is the method named decode? I understand that it decodes the gb2312 string to Unicode  
 B = look. Decode ()
#  The returned B [0] is the data, and B [1] is the length. At this time, the type is Unicode.  
  Print B [1], B [0], type (B [0])
 #  Convert the unicode encoded internally to a gb2312 encoded string. The encode method returns a string type.  
 B2 = look. encode (B [0])
 #  I found something different, right? After conversion, the string length is changed from 14 to 7! The length returned now is the actual number of words. The original length is the number of bytes.  
  Print B2 [1], B2 [0], type (B2 [0])
 # Although the number of words returned above, it does not mean that the length of B2 [0] is 7 with Len. It is still 14. It is only codecs. encode that counts the number of words.  
  Print Len (B2 [0])

The above code is the use of codecs, which is the most common usage. Another problem is, what if the character encoding in the file we process is of another type? This read operation also requires special processing. Codecs also provides methods.

 #  -*-Encoding: gb2312 -*-  
  Import Codecs, sys

 #  Use the open method provided by codecs to specify the language encoding of the opened file. It will be automatically converted to internal Unicode during reading.  
 Bfile = codecs. Open ( " Dddd.txt  " , '  R  ' , "  Big5  " )
 #  Bfile = open ("dddd.txt", 'R ')  
 
Ss = bfile. Read ()
Bfile. Close ()
 #  Output. In this case, the converted result is displayed. If you use the built-in OPEN function of the language to open the file, it must be garbled. 
  Print SS, type (SS)

If big5 is processed above, you can try to find a big5-encoded file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Convert Unicode-encoded TXT files to UTF-8 encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Convert Unicode-encoded TXT files to UTF-8 encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support