Workaround One:
Use Python beautifulsoup to crawl the page and then output the page title, but the output is always garbled, find a long time to find a solution, the following share to everyone
The first is the code
Copy the Code code as follows:
From BS4 import BeautifulSoup
Import Urllib2
url = ' http://www.jb51.net/'
page = Urllib2.urlopen (URL)
Soup = BeautifulSoup (page,from_encoding= "UTF8")
Print soup.original_encoding
Print (Soup.title). Encode (' GB18030 ')
File = Open ("Title.txt", "W")
File.write (str (soup.title))
File.close ()
For link in Soup.find_all (' a '):
Print link[' href ']
At the beginning of the test found that although the output is garbled, but written in the file is normal. And then find a solution on the internet to find out
Print the logic of an object: The inside is called the __str__ of the object to get the corresponding string, here corresponds to the soup __str__ and for the soup itself, is actually Unicode encoding, so you can specify __STR__ output when the encoding is GBK, So that the non-garbled Chinese is displayed correctly here
And for the cmd: (Chinese system) encoded as GBK, so as long as the re-encoded as GB18030 can be normal output
This is the line of code
Copy CodeThe code is as follows:
Print (Soup.title). Encode (' GB18030 ')
Workaround Two:
BeautifulSoup when parsing a utf-8-encoded Web page, if you do not specify Fromencoding or specify Fromencoding as Utf-8, there will be garbled characters in Chinese.
The workaround for this problem is to specify the value of the fromencoding parameter in the BeautifulSoup constructor as: GB18030
Copy the Code code as follows:
Import Urllib2
From BeautifulSoup import BeautifulSoup
page = Urllib2.urlopen (' http://www.jb51.net/');
Soup = BeautifulSoup (page,fromencoding= "GB18030")
Print soup.originalencoding
Print soup.prettify ()