[Step-by-step] Implementing parallel crawler in python
Problem Background: Specify the crawler depth and number of threads, and implement parallel crawler in python
Idea: Single-threaded crawler Fetcher
Multi-Thread threading. Thread to call Fetcher
Method: In Fetcher, use urllib. urlopen to open the specified url and read the information:
response = urllib.urlopen(self.url)content = response.read()
But there is a problem like this. For example, for www.sina.com, the read content is garbled:
>>> content[0:100]'?ì½k?×u ø??ÐHWè*t=2ëÕÕ]H`4@4??ð%êȪÊîN º²X??X¨Cj¶ly-?õ %ÊEñ!R?¨?C3?ñØ#?½;?Ø??±ò'
Therefore, chardet, a third-party python tool
chardet.detect(content)
To detect character sets in content:
>>> chardet.detect(content){'confidence': 0.99, 'encoding': 'GB2312'}
Okay. The problem is solved:
>>> import urllib>>> url = 'http://www.sina.com'>>> response = urllib.urlopen(url)>>> content = response.read()>>> chardet.detect(content){'confidence': 0.99, 'encoding': 'GB2312'}
However, we need to set the urlopen timeout time for efficient crawling, which is not implemented in urllib, but implemented in urllib2:
response = urllib2.urlopen(self.url, timeout = self.timeout)
However, at this time, the character set result displayed with chardet is different from the previous one:
>>> import urllib>>> url = 'http://www.sina.com'>>> response = urllib2.urlopen(url, timeout=1)>>> content = response.read()>>> chardet.detect(content){'confidence': 0.0, 'encoding': None}
What's going on? It turns out to be the encoding of this page. gzip encoding is returned for this page. For more information, see
In fact, you should determine whether the 'content-encoding' of the page information is 'gzip' each time '.
Urllib supports automatic decompression of gzip pages, but urllib2 does not. Therefore, for such pages, extract them before reading:
try: response = urllib2.urlopen(self.url, timeout = self.timeout) if response.info().get('Content-Encoding', ) == 'gzip': #e.g www.sina.com.cn buf = StringIO.StringIO(response.read()) f = gzip.GzipFile(fileobj=buf) content = f.read() else: content = response.read() content = self.enc_dec(content) return content except socket.timeout: log.warn(Timeout in fetching %s % self.url)
At this point, do you think I am just a title party ...?
**************************************** ***************************************
So share the entire spider file,
The program supports multi-thread crawling. The main file is spider. py, and testSpider. py is a single test (coverage is not guaranteed ).