Python's ability to crawl Web page information using multiple threads

Source: Internet
Author: User
This article mainly introduces the Python implementation of multi-threaded crawling Web page function, combined with specific examples of the Python multi-threaded programming of the relevant operational skills and considerations, and with the demo instance gives a multi-threaded crawl Web page implementation method, the need for friends can refer to the next

This article describes the Python implementation of multi-threaded crawling Web page functionality. Share to everyone for your reference, as follows:

Recently, has been doing web crawler related things. Looked at the open source C + + written Larbin crawler, carefully read the inside of the design ideas and some key technology implementation.

1, larbin the URL to reuse the very efficient Bloom filter algorithm;
2, DNS processing, the use of Adns asynchronous open source components;
3, for the URL queue processing, is partially cached to memory, partially write the file policy.
4, Larbin to the relevant operation of the document to do a lot of work
5, in the Larbin has the connection pool, through creates the socket, sends the HTTP protocol to the target site The Get method, obtains the content, then parses the header and so on things
6, a large number of descriptive words, through the poll method for I/O multiplexing, very efficient
7, Larbin is highly configurable
8, the author's use of a large number of data structures are written from the bottom of their own, basically useless STL and other things
......

There is a lot of time in the future to write a good article, summarized under.

These two days, a program that uses Python to write a multi-threaded download page is clearly a good solution for I/O intensive applications. The thread pool I've just written is just as good to use. In fact, using Python to crawl the page is very simple, there is a urllib2 module, easy to use, basic two or three lines of code can be done. Although the use of third-party modules, can be very convenient to solve the problem, but there is no benefit to the personal technology accumulation, because the key algorithms are implemented by others, rather than your own implementation, a lot of details of things, you simply can not understand. We do technology, can not blindly just use someone else to write a good module or API, to do their own implementation, to allow themselves to learn more.

I decided to write from the socket, but also to encapsulate the get protocol, parse the header, but also the DNS parsing process can be handled separately, such as DNS cache, so that their own writing, more controllable, more conducive to expansion. For timeout processing, I used the global 5-second time-out processing, for relocation (301or302) processing is, up to 3 times, because before the testing process, found that many site relocation and positioning themselves, so that the infinite loop, so set the upper limit. The specific principle, relatively simple, directly look at the code is good.

After writing their own, and urllib2 performance comparison, their own writing efficiency is relatively high, and URLLIB2 error rate slightly higher, do not know why. Online Some people say urllib2 in the multi-threaded background Some small problems, specifically I am not particularly clear.

Paste the code First:

fetchpage.py uses the Get method of the HTTP protocol to download the page and store it as a file


"' Created on 2012-3-13get Page using GET methoddefault using HTTP Protocol, http port 80@author:xiaojay ' ' Import socket Import Statisticsimport datetimeimport threadingsocket.setdefaulttimeout (Statistics.timeout) class Error404 (  Exception): "Can not find the page." Passclass Errorother (Exception): ' Some other Exception ' ' Def __init__ (self,code): #print ' Code: ', Code Passclas    S Errortrytoomanytimes (Exception): ' Try too many times ' passdef downpage (hostname, filename, trytimes=0): try: #To avoid too many tries. Try times can not is more than Max_try_times if Trytimes >= statistics.max_try_times:raise errortrytoomanytim ES except Errortrytoomanytimes:return statistics. Resulttrytoomany,hostname+filename try:s = Socket.socket (socket.af_inet,socket. SOCK_STREAM) #DNS the cache if statistics. Dnscache.has_key (hostname): addr = statistics. Dnscache[hostname] else:addr = socket.gethostbyname (hostname) statistics. Dnscache[hostnAME] = addr #connect to HTTP server, default port S.connect ((addr,80)) msg = ' GET ' +filename+ ' http/1.0\r\n ' msg + = ' Host: ' +hostname+ ' \ r \ n ' msg + = ' user-agent:xiaojay\r\n\r\n ' code = ' F = None s.sendall (msg) Fir St = True While true:msg = S.RECV (40960) if not Len (msg): If F!=none:f.flush () F.C Lose () Break # Head information must is in the first recv buffer if First:first = False He Adpos = Msg.index ("\r\n\r\n") Code,other = Dealwithhead (Msg[:headpos]) if code== ' $ ': #statistics.        Fetched_url + = 1 f = open (' pages/' +str (ABS (hash (hostname+filename))), ' W ') F.writelines (msg[headpos+4:])          elif code== ' 301 ' or code== ' 302 ': #if code is 301 or 302, try-down again using redirect location           If Other.startswith ("http"): hname, fname = Parse (other) downpage (hname,fname,trytimes+1) #try again        else:    Downpage (hostname,other,trytimes+1) elif code== ' 404 ': Raise Error404 else:raise Erro Rother (Code) else:if f!=none:f.writelines (msg) S.shutdown (socket. SHUT_RDWR) S.close () return statistics. Resultfetched,hostname+filename except Error404:return statistics. Resultcannotfind,hostname+filename except Errorother:return statistics. Resultother,hostname+filename except Socket.timeout:return statistics. Resulttimeout,hostname+filename except Exception, E:return statistics. Resultother,hostname+filenamedef Dealwithhead (head): "Deal with HTTP head" lines = Head.splitlines () Fstline = line S[0] Code =fstline.split () [1] If code = = ' 404 ': Return (code,none) if code = = ' $ ': return (code,none) if code = = '        301 ' or Code = = ' 302 ': For line in lines[1:]: p = line.index (': ') key = line[:p] If key== ' location ': Return (code,line[p+2:]) return (Code,none) def parse (URL): ' ' Parse a URL to HoStname+filename "try:u = Url.strip (). strip (' \ n '). Strip (' \ R '). Strip (' \ t ') if u.startswith ('/http '): U = u    [7:] elif u.startswith (' https://'): U = u[8:] If U.find (': ') >0:p = U.index (': ') P2 = p + 3 Else:if u.find ('/') >0:p = U.index ('/') P2 = p else:p = len (u) P2 = 1 hos Tname = u[:p] If p2>0:filename = u[p2:] else:filename = '/' return hostname, filename except Excepti On, E:print "Parse wrong:", url print edef printdnscache (): ' Print DNS dict ' ' n = 1 for hostname in STATIS Tics. Dnscache.keys (): print n, ' \ t ', hostname, ' \ t ', statistics. Dnscache[hostname] N+=1def Dealwithresult (res,url): ' Deal with the result of Downpage ' ' Statistics.total_url+=1 I F Res==statistics. Resultfetched:statistics.fetched_url+=1 print Statistics.total_url, ' \ t fetched: ', url if res==statistics. resultcannotfind:statistics.failed_url+=1 print "Error 404 at : ", url if res==statistics. RESULTOTHER:statistics.other_url +=1 print "Error Undefined at:", url if res==statistics. RESULTTIMEOUT:statistics.timeout_url +=1 print "timeout", url if res==statistics. Resulttrytoomany:statistics.trytoomany_url+=1 print E, "Try too many times at", Urlif __name__== ' __main__ ': print ' Get Page using Get method '

Below, I will use the thread pool of the previous article as the auxiliary, realize the parallel crawl under multi-thread, and use the method of download page and URLLIB2 to perform performance comparison.


"' Created on 2012-3-16@author:xiaojay '" Import fetchpageimport threadpoolimport datetimeimport Statisticsimport Urllib2 "One Thread" def usingonethread (limit): Urlset = open ("Input.txt", "r") Start = Datetime.datetime.now () for U In Urlset:if limit <= 0:break limit-=1 hostname, filename = Parse (u) res= fetchpage.downpage (hostname,f ilename,0) Fetchpage.dealwithresult (res) end = Datetime.datetime.now () print "start at: \ t", start print "end at: \ T ", end print" Total cost: \ t ", end-start print ' total fetched: ', Statistics.fetched_url ' Threadpoll and GET Metho d ' def callbackfunc (request,result): Fetchpage.dealwithresult (result[0],result[1]) def usingthreadpool (limit,num_ Thread): Urlset = open ("Input.txt", "r") Start = Datetime.datetime.now () main = ThreadPool. ThreadPool (num_thread) for URL in urlset:try:hostname, filename = fetchpage.parse (URL) req = ThreadPool . Workrequest (Fetchpage.downpage,args=[hostname,filename],kwds={},callBack=callbackfunc) main.putrequest (req) except Exception:print exception.message while True:try:m Ain.poll () if Statistics.total_url >= limit:break except ThreadPool. Noresultspending:print "No pending results" break except Exception, e:print e end = Datetime.datetime . Now () print "start at: \ t", start print "end at: \ t", end print "Total cost: \ t", end-start print "Total url: ', Statistics.total_url print ' Total fetched: ', statistics.fetched_url print ' Lost URL: ', Statistics.total_url-statisti Cs.fetched_url print ' Error 404: ', statistics.failed_url print ' Error timeout: ', statistics.timeout_url print ' Error T  Ry too many times ', statistics.trytoomany_url print ' Error other faults ', Statistics.other_url main.stop () ' ThreadPool and Urllib2 "def downPageUsingUrlib2 (URL): try:req = Urllib2. Request (URL) fd = Urllib2.urlopen (req) f = open ("pages3/" +STR (ABS (hash (URL))), ' W ') F.write (Fd.read ()) F.FLUsh () f.close () return URL, ' success ' except Exception:return URL, nonedef writefile (request,result): statist Ics.total_url + = 1 if result[1]!=none:statistics.fetched_url + = 1 print statistics.total_url, ' \tfetched: ', Resul T[0], Else:statistics.failed_url + = 1 print statistics.total_url, ' \tlost: ', Result[0],def usingThreadpoolUrllib2 (l Imit,num_thread): Urlset = open ("Input.txt", "r") Start = Datetime.datetime.now () main = ThreadPool. ThreadPool (num_thread) for the URL in urlset:try:req = ThreadPool.      Workrequest (Downpageusingurlib2,args=[url],kwds={},callback=writefile) main.putrequest (req) except Exception, E: Print e while True:try:main.poll () if Statistics.total_url >= limit:break except ThreadPool. Noresultspending:print "No pending results" break except Exception, e:print e end = Datetime.datetime . Now () print "start at: \ t", start print "end at: \ t", end print "Total cost: \ t", end-Start print ' Total URL: ', statistics.total_url print ' total fetched: ', statistics.fetched_url print ' Lost URL: ', st Atistics.total_url-statistics.fetched_url main.stop () if __name__ = = ' __main__ ': ' Too slow ' #usingOneThread (100) ' ' Use Get method ' #usingThreadpool (3000,50) ' "Use Urllib2" ' UsingThreadpoolUrllib2 (3000,50)

Experimental Analysis:

experimental data:Larbin crawl down the 3,000 URLs, after the Mercator queue model (I implemented in C + +, after a chance to send a blog) processed URL collection, with random and representative. Use a thread pool of 50 threads.
experimental Environment:ubuntu10.04, good network, python2.6
Storage: Small files, each page, one file for storage
PS: Because the school internet is charged by the flow of traffic, do network crawler, ASH constant fee flow Ah!!! In a few days, you might do a large-scale URL download experiment, with a hundreds of thousands of URL to try.

Experimental results:

Using urllib2 , UsingThreadpoolUrllib2 (3000,50)

Start at:2012-03-16 22:18:20.956054
End at:2012-03-16 22:22:15.203018
Total cost:0:03:54.246964
Total url:3001
Total fetched:2442
Lost url:559

Physical storage size of download page: 84088kb

Use your own getpageusingget, Usingthreadpool (3000,50)

Start at:2012-03-16 22:23:40.206730
End at:2012-03-16 22:26:26.843563
Total cost:0:02:46.636833
Total url:3002
Total fetched:2484
Lost url:518
Error 404:94
Error timeout:312
Error Try Too many times 0
Error other faults 112

Physical storage size of download page: 87168kb

Summary: self-written download page program, the efficiency is very good, and the missing page is also less. But in fact, I think about it, there are many places can be optimized, such as the file is too scattered, too much small file creation and release will produce a small performance cost, and the program with a hash name, will produce a lot of calculations, if there is a good strategy, in fact, these costs can be omitted. In addition to DNS, you can also do not use Python's own DNS resolution, because the default DNS resolution is synchronous operation, and DNS resolution is generally more time-consuming, can take a multi-threaded asynchronous way, and then the appropriate DNS cache to a large extent can improve efficiency. Not only that, in the actual page crawl process, there will be a large number of URLs, it is not possible to put them in memory at once, but should be based on a certain strategy or algorithm for reasonable allocation. In short, the collection of pages to do and things to optimize, there are many many.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.