Base Python implements multi-threaded web crawler

Source: Internet
Author: User
In general, there are two modes of using threads, one is to create a function to execute the thread, pass the function into the thread object, and let it execute. The other is to inherit directly from thread, create a new class, and put the thread execution code into this new class.

Implement multi-threaded web crawler, adopt multi-threading and lock mechanism, realize the breadth first algorithm of web crawler.

First of all, give you a brief introduction of my implementation of the idea:

For a web crawler, if you want to download by breadth, it's like this:

1. Download the first page from a given portal URL

2. Extract all new page addresses from the first page and put them in the download list

3. Press the address in the download list to download all new pages

4. Find the address of a webpage that has not been downloaded from all new pages and update the download list

5. Repeat 3, 42 steps until the updated download list is empty table when stopped

The Python code is as follows:

#!/usr/bin/env python#coding=utf-8import threadingimport urllibimport reimport timeg_mutex=threading. Condition () g_pages=[] #从中解析所有url链接g_queueURL =[] #等待爬取的url链接列表g_existURL =[] #已经爬取过的url链接列表g_failedURL =[] # Download failed URL link list g_totalcount=0 #下载过的页面数class crawler:def __init__ (self,crawlername,url,threadnum): self.crawlername= Crawlername self.url=url self.threadnum=threadnum self.threadpool=[] Self.logfile=file ("Log.txt", ' W ') def CRA W (self): global g_queueurl g_queueurl.append (URL) depth=0 print self.crawlername+ "Start ..." while (Len (g_que Ueurl)!=0): depth+=1 print ' searching depth ', depth, ' ... \ n ' self.logfile.write ("URL:" +g_queueurl[0]+ "....) ....) Self.downloadall () Self.updatequeueurl () content= ' \n>>>depth ' +str (Depth) + ': \ n ' self.l Ogfile.write (content) i=0 while I
 
  ' +g_queueurl[i]+ ' \ n ' self.logfile.write (content) i+=1 def downloadall (self): global G_queueurl Global G_totalcount i=0 While I
  
   

The above code is for everyone to share the basis of Python implementation of multi-threaded web crawler, I hope you like.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.