In general, there are two modes of using threads, one is to create a function to execute the thread, pass the function into the thread object, and let it execute. The other is to inherit directly from thread, create a new class, and put the thread execution code into this new class.
Implement multi-threaded web crawler, adopt multi-threading and lock mechanism, realize the breadth first algorithm of web crawler.
First of all, give you a brief introduction of my implementation of the idea:
For a web crawler, if you want to download by breadth, it's like this:
1. Download the first page from a given portal URL
2. Extract all new page addresses from the first page and put them in the download list
3. Press the address in the download list to download all new pages
4. Find the address of a webpage that has not been downloaded from all new pages and update the download list
5. Repeat 3, 42 steps until the updated download list is empty table when stopped
The Python code is as follows:
#!/usr/bin/env python#coding=utf-8import threadingimport urllibimport reimport timeg_mutex=threading. Condition () g_pages=[] #从中解析所有url链接g_queueURL =[] #等待爬取的url链接列表g_existURL =[] #已经爬取过的url链接列表g_failedURL =[] # Download failed URL link list g_totalcount=0 #下载过的页面数class crawler:def __init__ (self,crawlername,url,threadnum): self.crawlername= Crawlername self.url=url self.threadnum=threadnum self.threadpool=[] Self.logfile=file ("Log.txt", ' W ') def CRA W (self): global g_queueurl g_queueurl.append (URL) depth=0 print self.crawlername+ "Start ..." while (Len (g_que Ueurl)!=0): depth+=1 print ' searching depth ', depth, ' ... \ n ' self.logfile.write ("URL:" +g_queueurl[0]+ "....) ....) Self.downloadall () Self.updatequeueurl () content= ' \n>>>depth ' +str (Depth) + ': \ n ' self.l Ogfile.write (content) i=0 while I
' +g_queueurl[i]+ ' \ n ' self.logfile.write (content) i+=1 def downloadall (self): global G_queueurl Global G_totalcount i=0 While I
The above code is for everyone to share the basis of Python implementation of multi-threaded web crawler, I hope you like.