Workaround One:
Use Python beautifulsoup to crawl the page and then output the page title, but the output is always garbled, find a long time to find solutions, the following share to everyone
First, the code.
Copy Code code as follows:
From BS4 import BeautifulSoup
Import Urllib2
url = ' http://www.jb51.net/'
page = Urllib2.urlopen (URL)
Soup = BeautifulSoup (page,from_encoding= "UTF8")
Print soup.original_encoding
Print (Soup.title). Encode (' GB18030 ')
File = Open ("Title.txt", "W")
File.write (str (soup.title))
File.close ()
For link in Soup.find_all (' a '):
Print link[' href ']
At the beginning of the test found that, although the output is garbled, but written in the file is normal. And then find a solution on the Internet.
Print the logic of an object: The internal is called the object's __str__ to get the corresponding string, here corresponds to the soup __str__ and for soup itself, is already Unicode encoding, so you can specify the __STR__ output when the encoding for GBK, To enable the correct display of non-garbled Chinese here
And for CMD: (Chinese system) encoded as GBK, so as long as the GB18030 code to the normal output
The following line of code
Copy Code code as follows:
Print (Soup.title). Encode (' GB18030 ')
Workaround Two:
BeautifulSoup when parsing a utf-8-encoded Web page, if you do not specify a fromencoding or if you specify Fromencoding as Utf-8, the Chinese garbled behavior occurs.
The workaround for this problem is to specify the value of the fromencoding parameter in the BeautifulSoup constructor as: GB18030
Copy Code code as follows:
Import Urllib2
From BeautifulSoup import BeautifulSoup
page = Urllib2.urlopen (' http://www.jb51.net/');
Soup = BeautifulSoup (page,fromencoding= "GB18030")
Print soup.originalencoding
Print soup.prettify ()