[Step-by-step] Implementing parallel crawler in python

Source: Internet
Author: User

[Step-by-step] Implementing parallel crawler in python

Problem Background: Specify the crawler depth and number of threads, and implement parallel crawler in python
Idea: Single-threaded crawler Fetcher
Multi-Thread threading. Thread to call Fetcher

Method: In Fetcher, use urllib. urlopen to open the specified url and read the information:

 

response = urllib.urlopen(self.url)content = response.read()
But there is a problem like this. For example, for www.sina.com, the read content is garbled:

 

 

>>> content[0:100]'?ì½k?×u ø??ÐHWè*t=2ëÕÕ]H`4@4??ð%êȪÊîN º²X??X¨Cj¶ly-?õ %ÊEñ!R?¨?C3?ñØ#?½;?Ø??±ò'


 

Therefore, chardet, a third-party python tool

chardet.detect(content)
To detect character sets in content:

 

 

>>> chardet.detect(content){'confidence': 0.99, 'encoding': 'GB2312'}

Okay. The problem is solved:

 

 

>>> import urllib>>> url = 'http://www.sina.com'>>> response = urllib.urlopen(url)>>> content = response.read()>>> chardet.detect(content){'confidence': 0.99, 'encoding': 'GB2312'}


 

 

However, we need to set the urlopen timeout time for efficient crawling, which is not implemented in urllib, but implemented in urllib2:

 

response = urllib2.urlopen(self.url, timeout = self.timeout)

 

However, at this time, the character set result displayed with chardet is different from the previous one:

 

>>> import urllib>>> url = 'http://www.sina.com'>>> response = urllib2.urlopen(url, timeout=1)>>> content = response.read()>>> chardet.detect(content){'confidence': 0.0, 'encoding': None}

 

What's going on? It turns out to be the encoding of this page. gzip encoding is returned for this page. For more information, see

 

In fact, you should determine whether the 'content-encoding' of the page information is 'gzip' each time '.

Urllib supports automatic decompression of gzip pages, but urllib2 does not. Therefore, for such pages, extract them before reading:

 

 

 
try: response = urllib2.urlopen(self.url, timeout = self.timeout) if response.info().get('Content-Encoding', ) == 'gzip': #e.g www.sina.com.cn buf = StringIO.StringIO(response.read()) f = gzip.GzipFile(fileobj=buf) content = f.read() else: content = response.read() content = self.enc_dec(content) return content except socket.timeout: log.warn(Timeout in fetching %s % self.url)

 

 

 

At this point, do you think I am just a title party ...?

 

**************************************** ***************************************

 

So share the entire spider file,

 

The program supports multi-thread crawling. The main file is spider. py, and testSpider. py is a single test (coverage is not guaranteed ).

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.