How to solve Character Set conversion problems when python captures webpages

Source: Internet
Author: User

Question:

Sometimes we collect webpages and save the strings to files or write them into the database after processing. At this time, we need to develop the encoding of strings. If the encoding of the webpage is gb2312, and our database is UTF-8, in this case, if you do not perform any processing and directly Insert the data into the database, garbled characters may occur (no tests have been conducted and you do not know whether the database will automatically transcode the data). We need to manually convert gb2312 to UTF-8.

First of all, we know that the character in python is an ascii code by default. Of course, there is no problem in English. When it comes to Chinese, we immediately kneel down.

I don't know if you still can't remember. When printing Chinese Characters in python, you need to add u before the string:

Print u? "

In this way, only Chinese characters can be displayed. Here, u is used to convert the subsequent strings into unicode codes so that Chinese characters can be correctly displayed.
Here there is a unicode () function related to it. Its usage is as follows:

Str = "la "str = unicode (str, "UTF-8") print str

The difference with u is that here str is converted to unicode encoding using unicode. You need to specify the second parameter correctly. Here UTF-8 is my test. the file character set of The py script. The default value may be ansi.
Unicode is the key.

We started to capture the Baidu homepage. Note that visitors visit the Baidu homepage to view the source code of the webpage. Its charset is gb2312.

import urllib2def main():  f=urllib2.urlopen("http://www.baidu.com")  str=f.read()  str=unicode(str,"gb2312")  fp=open("baidu.html","w")  fp.write(str.encode("utf-8"))  fp.close()if __name__ == '__main__' :  main()

Explanation:
We first use urllib2.urlopen () method to capture the Baidu homepage. f is the handle, and use str = f. read () to read all source code into str.

Clearly, str contains the html source code we captured. Because the default Character Set of the webpage is gb2312, if we save it directly to the file, the file encoding will be ansi.

For most people, this is enough, but sometimes I want to convert gb2312 to UTF-8. What should I do?

First:
Str = unicode (str, "gb2312") # Here, gb2312 is the actual character set of str. Now we convert it to unicode

Then:
Str = str. encode ("UTF-8") # recode the unicode string to UTF-8.

Finally:

Write str to the file. open the file and check the encoding properties. The encoding is UTF-8. Change <meta charset = "gb2312" to <meta charset = "UTF-8 ", is a UTF-8 webpage. After doing so much, we actually completed a gb2312-> UTF-8 transcoding.


Summary:

To save a string according to the specified character set, perform the following steps:

1: decodes str into a unicode string using unicode (str, "original encoding ").

2: Use str. encode ("specified character set") to convert the unicode string str to the character set you specified.

3: Save the str file or write it to the database. Of course, you have specified the encoding, aren't you?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.