"Original" to write multi-threaded Python crawler to filter the eight-ring online publishing mission

Source: Internet
Author: User
Tags webhost

Goal:

Use specific language technologies as keywords to crawl the task related information published under the Website design and Development section of the eight-ring network

Demand:

Users filter information by setting their own keywords or regular expressions of interest.

My choice is to use specific language technology as a keyword, PHP, java, and Python.

Note: If you do not use regular expressions, JavaScript will also crawl in, the front end of the information is more.

Why use Multithreading:

Network rotten, read the page is very easy to block, this time behind the work have to wait;

When saving the page, there is a need for hard disk I/O, if it is blocked also have to wait.

Realize:

0, 3 threads. A thread A is responsible for reading the Web page, a thread B is responsible for parsing the returned page and analyzing the required data, and a thread C is responsible for writing the required data to the hard disk.

1, a thread communicates through a list and B thread, and B threads communicate through a list and C thread. A is a pure producer, B is a consumer when confronted with a, C is a producer and C is a pure producer. You can think of 3 threads as a linked list, a--and B--C, where a thread must end first, followed by B, and finally C. Note, however, that the previous thread is over, and if there is data in the list, the subsequent process needs to consume the data before it ends.

2, since to access the shared area, nature is locked mutually exclusive.

3, the specific how to analyze the page is not said, relatively simple. Eight-ring network to do the comparison, are in the <li></li> tag inside, very good identification. Output when I choose to output as an HTML file, so that directly can be viewed as a Web page.

All code:

# @author SHADOWMYDX

Import urllib2import refrom Threading Import thread,locklistpage = [] # Web page read thread and Web page parsing thread Communication cache Area Listresu = [] # page parsing thread and output thread communication Cache area Listfilter = []listfilter.append (re.compile (R ' php ')) Listfilter.append (Re.compile (R ' [Pp]ython ')) Listfilter.append (Re.compile (R ' [Jj]ava[^ss])) # prevents matching to Javascriptpagelock = Lock () # A and b locks Writlock = lock () # B and C locks open end = False # A does the thread end? Analend = False # B Does the thread end? target = R ' http://www.witmart.com/cn/web-design/jobs ' webhost = R ' Http://www.witmart.com/cn/web-design/jobs ' NumPages = 22class Readpagethread (Thread): def run (self): Global listpage Global target global num Pages Global Pagelock Global openend self.nextpage = 1 while numpages! = 0:f = Sel F.openpage (target) Pagelock.acquire () listpage.append (f) Print Target + ' is finished. ' Pagelock.release () target = Self.findnext (f) numpages-= 1 openend = True DEF Openpage (self,target): tmp = True while Tmp:try:print ' open page ... ' f = Urllib2.urlopen (target). Read () print ' Open successed! ' Break except:tmp = True return F def findNext (Self,target): Global webhost Self.nextpage + = 1 return webhost + '? p= ' + str (self.nextpage) class Analspagethread (Thread): def run (self): Global listpage Global Pagelock Global openend Global Analend f = False While not openend or Len (listpage)! = 0:pagelock.acquire () If Len (listpage)! = 0:f = Listpage.pop (0) else:f = False pagelock.release () I F f! = False:self.analsPage (f) analend = True def analspage (self,target): Global Listresu Global Writlock Global Listfilter UL = R ' <ul class= "joblist" ' Liitem = Re.compile (R ' <li.*?</li> ', re. Dotall) Ulstart = Target.find (ul) target = Target[ulstart:] lilist = Liitem.findall (target) For item in Lilist: # judge if have PHP for key in Listfilter:if Key.search (i TEM): Writlock.acquire () item = SELF.REPLACEHREF (item) LISTRESU.A Ppend (item) print ' Analysis one item success! ' Writlock.release () Break def replacehref (Self,item): Return item.replace ('/cn ', ' HTTP://WW W.WITMART.COM/CN ') class Writepagethread (Thread): def __init__ (Self,pathto): thread.__init__ (self) SELF.PA Thto = Pathto def run (self): Global Listresu Global Writlock global analend f = open (SE Lf.pathto + '/' + ' res.html ', ' WB ') F.write (R '

"Original" to write multi-threaded Python crawler to filter the eight-ring online publishing mission

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.