Python modules that require research and learning to use python for crawling
1. the built-in urllib and urllib2 library are used to crawl data.
2. Use BeautifulSoup for data cleansing
Http://www.crummy.com/software/BeautifulSoup/
Encoding Rules
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
1 An encoding you pass in as the fromEncoding argument to the soup constructor.
2 An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. if Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. the only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
3 An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
4 An encoding sniffed by the chardet library, if you have it installed.
UTF-8
6 Windows-1252
You can use the fromEncoding parameter to construct BeautifulSoup.
Soup = BeautifulSoup (euc_jp, fromEncoding = "gbk ")
3 Use python chardet character encoding to determine
Http://chardet.feedparser.org/download/
4 more powerful selenium
Author Zhang Dapeng