Problem background: Specify crawler depth, number of threads, python implementation of parallel crawler
Idea: single-threaded implementation of crawler Fetcher
Multithreaded Threading. Thread to tune Fetcher
Method: Fetcher, open the specified URL with Urllib.urlopen, read the information:
Response = Urllib.urlopen (self.url) content = Response.read ()
But this is problematic, for example, for www.sina.com, the content that is read is garbled:
>>> content[0:100] ' \x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xbdk\x93\x1c\xd7u \xf8\x99\x8c\xd0\x7fH \x14w\xe8*t=2\xeb\xd5\xd5]h ' \[email protected]\x88\x97\x00\xf0%\x10\xea\xc8\xaa\xca\xeen\xa0\xba\xb2x\x99\x85\ X06x\xa8\x1fcj\x1c\xb6ly-\x92\x06\xf5%\xca "e\xf1! R\x94\xa8\x87c3\x9e\xf1\xd8#\x87\xbd;\x8e\xd8\x99\x8d\xb1\x1d\xf2 '
Using Python's third-party tool, Chardet,
Chardet.detect (content)
To detect the character set in the content:
>>> Chardet.detect (content) {' confidence ': 0.99, ' encoding ': ' GB2312 '}
OK, the problem is solved:
>>> import urllib>>> url = ' http://www.sina.com ' >>> response = urllib.urlopen (URL) >> > content = Response.read () >>> chardet.detect (content) {' confidence ': 0.99, ' encoding ': ' GB2312 '}
But when we think of efficient crawlers we need to set the timeout time of the Urlopen, which is not implemented in Urllib, but is implemented in URLLIB2:
Response = Urllib2.urlopen (Self.url, timeout = self.timeout)
But this time again the character set that appears with Chardet is different from the last:
>>> import urllib>>> url = ' http://www.sina.com ' >>> response = urllib2.urlopen (URL, timeout= 1) >>> content = Response.read () >>> chardet.detect (content) {' confidence ': 0.0, ' encoding ': None}
What the hell is going on here? it turned out to be a coding problem for this page, the page returned is GZIP encoded, reference <python urllib2 returns garbage-stack Overflow>
In fact, each time the page information should be judged ' content-encoding ' is ' gzip '.
Urllib supports automatic decompression of gzip pages and URLLIB2 not supported. So for this page, first unzip and then read:
Try: Response = Urllib2.urlopen ( Self. URL, Timeout= Self. Timeout)offResponse.info (). Get (' content-encoding ', "") ==' gzip ':#e. G www.sina.com.cn BUF = Stringio.stringio (Response.read ()) F = gzip. Gzipfile (Fileobj=BUF) content = F.read ()Else: Content = response.read () content = Self. Enc_dec (content)returnContentexceptSocket.timeout:log.warn ("Timeout in fetching%s"% Self. URL)
Here, do you all think I'm just a title party ...?
*******************************************************************************
So, just share the entire spider file for tuning,
The program supports multi-threaded crawler, the main file is spider.py, testspider.py is a single test (not guaranteed coverage).
Program Address: http://download.csdn.net/detail/abcjennifer/9086751
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
[Stepping pit]python to realize parallel crawler