"Original" to write multi-threaded Python crawler to filter the eight-ring online publishing mission

Last Update:2014-12-17 Source: Internet

Author: User

Tags webhost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Goal:

Use specific language technologies as keywords to crawl the task related information published under the Website design and Development section of the eight-ring network

Demand:

Users filter information by setting their own keywords or regular expressions of interest.

My choice is to use specific language technology as a keyword, PHP, java, and Python.

Note: If you do not use regular expressions, JavaScript will also crawl in, the front end of the information is more.

Why use Multithreading:

Network rotten, read the page is very easy to block, this time behind the work have to wait;

When saving the page, there is a need for hard disk I/O, if it is blocked also have to wait.

Realize:

0, 3 threads. A thread A is responsible for reading the Web page, a thread B is responsible for parsing the returned page and analyzing the required data, and a thread C is responsible for writing the required data to the hard disk.

1, a thread communicates through a list and B thread, and B threads communicate through a list and C thread. A is a pure producer, B is a consumer when confronted with a, C is a producer and C is a pure producer. You can think of 3 threads as a linked list, a--and B--C, where a thread must end first, followed by B, and finally C. Note, however, that the previous thread is over, and if there is data in the list, the subsequent process needs to consume the data before it ends.

2, since to access the shared area, nature is locked mutually exclusive.

3, the specific how to analyze the page is not said, relatively simple. Eight-ring network to do the comparison, are in the <li></li> tag inside, very good identification. Output when I choose to output as an HTML file, so that directly can be viewed as a Web page.

All code:

# @author SHADOWMYDX

Import urllib2import refrom Threading Import thread,locklistpage = [] # Web page read thread and Web page parsing thread Communication cache Area Listresu = [] # page parsing thread and output thread communication Cache area Listfilter = []listfilter.append (re.compile (R ' php ')) Listfilter.append (Re.compile (R ' [Pp]ython ')) Listfilter.append (Re.compile (R ' [Jj]ava[^ss])) # prevents matching to Javascriptpagelock = Lock () # A and b locks Writlock = lock () # B and C locks open end = False # A does the thread end? Analend = False # B Does the thread end? target = R ' http://www.witmart.com/cn/web-design/jobs ' webhost = R ' Http://www.witmart.com/cn/web-design/jobs ' NumPages = 22class Readpagethread (Thread): def run (self): Global listpage Global target global num Pages Global Pagelock Global openend self.nextpage = 1 while numpages! = 0:f = Sel             F.openpage (target) Pagelock.acquire () listpage.append (f) Print Target + ' is finished. '                Pagelock.release () target = Self.findnext (f) numpages-= 1 openend = True DEF Openpage (self,target): tmp = True while Tmp:try:print ' open page ... '                f = Urllib2.urlopen (target). Read () print ' Open successed! '  Break except:tmp = True return F def findNext (Self,target): Global webhost Self.nextpage + = 1 return webhost + '? p= ' + str (self.nextpage) class Analspagethread        (Thread): def run (self): Global listpage Global Pagelock Global openend Global Analend f = False While not openend or Len (listpage)! = 0:pagelock.acquire () If Len (listpage)! = 0:f = Listpage.pop (0) else:f = False pagelock.release () I         F f! = False:self.analsPage (f) analend = True def analspage (self,target): Global Listresu   Global Writlock Global Listfilter     UL = R ' <ul class= "joblist" ' Liitem = Re.compile (R ' <li.*?</li> ', re.         Dotall) Ulstart = Target.find (ul) target = Target[ulstart:] lilist = Liitem.findall (target) For item in Lilist: # judge if have PHP for key in Listfilter:if Key.search (i TEM): Writlock.acquire () item = SELF.REPLACEHREF (item) LISTRESU.A                    Ppend (item) print ' Analysis one item success! ' Writlock.release () Break def replacehref (Self,item): Return item.replace ('/cn ', ' HTTP://WW W.WITMART.COM/CN ') class Writepagethread (Thread): def __init__ (Self,pathto): thread.__init__ (self) SELF.PA Thto = Pathto def run (self): Global Listresu Global Writlock global analend f = open (SE   Lf.pathto + '/' + ' res.html ', ' WB ') F.write (R ' 
"Original" to write multi-threaded Python crawler to filter the eight-ring online publishing mission

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More