Then the first article continued to study.
I. Classification of data
Correct data: ID, gender, activity time are all three
Put it in this file File1 = ' ruisi\\correct%s-%s.txt '% (Startnum, endnum)
Data format for 293001 men 2015-5-1 19:17
- No time: ID, gender, no active time
Put this file in file2 = ' ruisi\\errtime%s-%s.txt '% (Startnum, endnum)
Data format is 2566 female notime
- The user does not exist: The ID does not have a corresponding user
Put this file in file3 = ' ruisi\\notexist%s-%s.txt '% (Startnum, endnum)
Data format is 29005 notexist
- Unknown Gender: There is an ID, but the gender is not known from the Web page (checked, this situation is not active time)
Put this file in file4 = ' ruisi\\unkownsex%s-%s.txt '% (Startnum, endnum)
Data Format 221794 Unkownsex
- Network error: A broken network, or a server failure, need to re-check these IDs
Put this file in file5 = ' ruisi\\httperror%s-%s.txt '% (Startnum, endnum)
Data Format 271004 Httperror
How to get the crawler information without interruption
- This project has a consideration: is uninterrupted crawl information, if because of broken network, BBS server failure What, my crawler quit words. We still have to crawl from the ground up, or the more troublesome is to start from the beginning.
- So, the way I do this is to record the ID of these exceptions if a failure is encountered. After a traversal, the IDs of these exceptions are re-crawled to gender.
- This article series (i) gives a getInfo (Myurl, Seword) that crawls information through a given link and a given regular expression.
- This function can be used to view the last activity time information of the gender.
- We then define a safe crawl function that will not run intermittently, and use the try except exception handling here.
Here the code tried two times GetInfo (Myurl, Seword), if the 2nd time or throw an exception, the ID is stored in the File5
If the information is available, the information is returned
File5 = ' ruisi\\httperror%s-%s.txt '% (Startnum, endnum) def safeget (myID, Myurl, Seword): try: return GetInfo ( Myurl, Seword) except: try: return GetInfo (Myurl, Seword) except: httperrorfile = open (File5, ' A ') info = '%d%s\n '% (myID, ' httperror ') httperrorfile.write (info) httperrorfile.close () return ' Httperror '
Loop through to get the user information for ID from [1,300,000]
We define a function, where the idea is to get sex and time, if there is sex, and then continue to determine if there is time, and if there is no sex, to determine whether the user does not exist or the gender cannot crawl.
It should take into account the fault of the network or BBS server failure situation.
URL1 = ' http://rs.xidian.edu.cn/home.php?mod=space&uid=%s ' url2 = ' http://rs.xidian.edu.cn/home.php?mod=space &uid=%s&do=profile ' Def searchweb (Idarr): for ID in idarr:sexurl = url1% (ID) #将%s replaced with id timeurl = url2% ( ID) Sex = safeget (Id,sexurl, sexre) if not sex: #如果sexUrl里面找不到性别, try again in timeurl sex = Safeget (Id,timeurl, Sexre Time = Safeget (Id,timeurl, Timere) #如果出现了httperror, requires a re-crawl of if (sex is ' httperror ') or (Time is ' Httperror '): Pass Else:if sex:info = '%d%s '% (ID, sex) If Time:info = '%s%s\n '% (info, time) Wfile = open (File1, ' a ') Wfile.write (info) wfile.close () Else:info = '%s%s\n ' % (info, ' notime ') errtimefile = open (File2, ' a ') Errtimefile.write (info) errtimefile.close () else: #这儿是性别是None, and then determine if the user does not exist #断网的时候加上这个, will result in 4 duplicate Httperror #可能用户的性别我们无法知道, he did not fill notexist = Safeget (Id,sexurl, Notexistre) If Notexist is ' httperror ': pass else:if notexist:notexistfile = open (File3, ' a ') info = '%d%s\n '% (ID, ' notexist ') notexistfile.write (info) notexistfile.close () Else:unkownsexfile = open (file4, ' a ') info = '%d%s\n '% (ID, ' unkownsex ') unkown Sexfile.write (Info) unkownsexfile.close ()
Here, the post-inspection found a problem.
Sex = Safeget (Id,sexurl, sexre) if not sex: sex = Safeget (Id,timeurl, sexre) time = Safeget (Id,timeurl, Timere)
This code if the network, call 3 times Safeget, each call will go to the text inside the same ID write multiple times httperror
251538 httperror251538 httperror251538 httperror251538 httperror
Multi-threaded crawl information?
Data statistics can be multithreaded, because it is independent of multiple text
1, Popen Introduction
Use Popen to customize standard input, standard output, and standard error output. When I was practicing at SAP, the project team used Popen frequently under the Linux platform, possibly because it was easy to redirect the output.
The following code draws on the implementation of the previous project group, Popen can invoke the system cmd command. The following 3 communicate () are joined together to indicate that all 3 threads will end.
Doubts?
To test it, it must be 3 communicate () next to each other to ensure that 3 threads are open at the same time, and finally wait for 3 threads to end.
P1=popen ([' Python ', ' ruisi.py ', str (s0), str (S1)],bufsize=10000, stdout=subprocess. PIPE) P2=popen ([' Python ', ' ruisi.py ', str (s1), str (s2)],bufsize=10000, stdout=subprocess. PIPE) P3=popen ([' Python ', ' ruisi.py ', str (s2), str (S3)],bufsize=10000, Stdout=subprocess. PIPE) p1.communicate () p2.communicate () p3.communicate ()
2, define a single-threaded crawler
Usage: Python ruisi.py
This code is to crawl [Startnum, endnum) information, output to the corresponding text. It is a single-threaded program, in order to achieve multi-threaded, in the external call it where the implementation of multithreading.
# ruisi.py# Coding=utf-8import urllib2, RE, sys, threading, time,thread# myurl as specifies the link # Seword as regular expression, in Unicode notation # returns according to the regular Expression matching information or Nonedef getInfo (Myurl, Seword): headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req = Urllib2. Request (Url=myurl, Headers=headers) time.sleep (0.3) response = Urllib2.urlopen (req) HTML = response.read () HT ml = Unicode (HTML, ' utf-8 ') Timematch = seword.search (html) If timematch:s = Timematch.groups () return s[0] Else : Return none# try two times getInfo () #第2次失败后, mark this ID as Httperrordef safeget (myID, Myurl, Seword): Try:return getInfo (Myurl, S Eword) Except:try:return getInfo (Myurl, seword) except:httperrorfile = open (File5, ' a ') info = '% D%s\n '% (myID, ' Httperror ') httperrorfile.write (info) httperrorfile.close () return ' Httperror ' #输出一个 Idarr range, such as [1,1001) def searchweb (Idarr): for ID in idarr:sexurl = url1% (ID) timeurL = url2% (id) Sex = safeget (Id,sexurl, sexre) if not sex:sex = Safeget (Id,timeurl, sexre) time = Safeget ( Id,timeurl, Timere) if (sex is ' httperror ') or (Time is ' Httperror '): Pass else:if sex:info = '% D%s '% (ID, sex) If Time:info = '%s%s\n '% (info, time) wfile = open (File1, ' a ') wfil E.write (Info) wfile.close () Else:info = '%s%s\n '% (info, ' notime ') errtimefile = Open (File2, ' a ') Errtimefile.write (info) errtimefile.close () Else:notexist = Safeget (Id,sexurl, Notexistre) If Notexist is ' httperror ': pass else:if notexist:notexistfile = Open (File3, ' a ') info = '%d%s\n '% (ID, ' notexist ') notexistfile.write (info) Notexistfi Le.close () else:unkownsexfile = open (file4, ' a ') info = '%d%s\n '% (ID, ' unkownsex ') Unkownsexfile.wriTe (info) unkownsexfile.close () def main (): Reload (SYS) sys.setdefaultencoding (' Utf-8 ') If Len (sys.argv)! = 3: print ' Usage:python ruisi.py
' Sys.exit ( -1) Global sexre,timere,notexistre,url1,url2,file1,file2,file3,file4,startnum,endnum,file5 StartNum=int (Sys.argv[1]) endnum=int (sys.argv[2]) sexre = re.compile (U ' em>\u6027\u522b(.*?)\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4(.*?)) \u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728< ') URL1 = ' http// rs.xidian.edu.cn/home.php?mod=space&uid=%s ' url2 = ' http://rs.xidian.edu.cn/home.php?mod=space&uid=%s &do=profile ' file1 = '. \\newRuisi\\correct%s-%s.txt '% (Startnum, endnum) file2 = '. \\newRuisi\\errTime%s-%s.txt '% (Startnum, endnum) file3 = '. \\newRuisi\\notexist%s-%s.txt '% (Startnum, endnum) file4 = '. \\newRuisi\\unkownsex%s-%s.txt '% (Startnum, endnum) file5 = '. \\newRuisi\\httperror%s-%s.txt '% (Startnum, endnum) searchweb (xrange (Startnum,endnum)) # numthread = ten # searchweb (XR Ange (endnum) # total = 0 # for I in Xrange (numthread): # data = Xrange (1+i,endnum,numthread) # all =+ len (data) # t=threading. Thread (target=searchweb,args= (data)) # T.start () # Print Totalmain ()
Multi-threaded Crawler
Code
# coding=utf-8from subprocess Import popenimport subprocessimport threading,timestartn = 1endn = 300001step =1000total = ( ENDN-STARTN + 1)/stepisotimeformat= '%y-%m-%d%x ' #hardcode 3 threads# did not delve into 3 lines Cheng or 4 or more threads Good # Output formatted month day seconds Time-consuming (in seconds) for the output program for I in Xrange (0,total,3): Startnumber = startn + Step * I startTime = Time.clock () S0 = Startnumber S1 = Startnumber + Step s2 = startnumber + step*2 s3 = Startnumber + step*3 p1=popen ([' Python ', ' ruisi.py ', str (s0), str (s 1)],bufsize=10000, stdout=subprocess. PIPE) P2=popen ([' Python ', ' ruisi.py ', str (s1), str (s2)],bufsize=10000, stdout=subprocess. PIPE) P3=popen ([' Python ', ' ruisi.py ', str (s2), str (S3)],bufsize=10000, Stdout=subprocess. PIPE) startftime = ' [' + Time.strftime (Isotimeformat, Time.localtime ()) + '] ' Print Startftime + '%s-%s download St Art ... '% (S0, s1) print Startftime + '%s-%s download start ... '% (S1, s2) print Startftime + '%s-%s download start ... '% (S2, S3) P1.communicate () p2.communicate () p3.communicate () Endftime = ' [' + Time.strftime (Isotimeformat, Time.localtime ()) + '] ' Print Endftime + '%s-%s download END!!! '% (S0, s1) print Endftime + '%s-%s download END!!! '% (S1, s2) print Endftime + '%s-%s download END!!! '% (S2, s3) EndTime = Time.clock () print "Cost time" + str (endtime-starttime) + "s" time.sleep (5)
Here is the log that records the timestamp:
Above is a multi-threaded log record, from the following can be seen, the average 1000 users need 500s, an ID needs 0.5s. 500*300/3600 = 41.666666666667 hours, it takes about two days.
Let's try again the time-consuming single-threaded crawler, recorded as follows:
We found that the time it took for a thread to crawl 1000 users also required 1500s, while a multi-threaded session was a 3*1000 user consuming 1500s.
So multithreading really can save a lot of time than a single thread.
Note:
In GetInfo (Myurl, Seword) There are time.sleep (0.3) Such a piece of code, is to prevent critical access to BBS, and was denied access by BBS. This 0.3s has an impact on the above multithreading and single-threaded statistical time.
Finally, attach the original, no time-stamped records. (plus time stamp, you can know when the program started the crawler, in order to deal with the thread of the card dead.) )
"D:\Program files\python27\python.exe" e:/pythonproject/webcrawler/sum.py1-1001 download start ... 1001-2001 Download Start ... 2001-3001 Download Start ... 1-1001 Download END!!! 1001-2001 Download END!!! 2001-3001 Download END!!! Cost time 1532.74102812 s3001-4001 download start ... 4001-5001 Download Start ... 5001-6001 Download Start ... 3001-4001 Download END!!! 4001-5001 Download END!!! 5001-6001 Download END!!! Cost time 2652.01624951 s6001-7001 download start ... 7001-8001 Download Start ... 8001-9001 Download Start ... 6001-7001 Download END!!! 7001-8001 Download END!!! 8001-9001 Download END!!! Cost time 1880.61513696 s9001-10001 download start ... 10001-11001 Download Start ... 11001-12001 Download Start ... 9001-10001 Download END!!! 10001-11001 Download END!!! 11001-12001 Download END!!! Cost time 1634.40575553 s12001-13001 download start ... 13001-14001 Download Start ... 14001-15001 Download Start ... 12001-13001 downlOad END!!! 13001-14001 Download END!!! 14001-15001 Download END!!! Cost time 1403.62795496 s15001-16001 download start ... 16001-17001 Download Start ... 17001-18001 Download Start ... 15001-16001 Download END!!! 16001-17001 Download END!!! 17001-18001 Download END!!! Cost time 1271.42177906 s18001-19001 download start ... 19001-20001 Download Start ... 20001-21001 Download Start ... 18001-19001 Download END!!! 19001-20001 Download END!!! 20001-21001 Download END!!! Cost time 1476.04122024 s21001-22001 download start ... 22001-23001 Download Start ... 23001-24001 Download Start ... 21001-22001 Download END!!! 22001-23001 Download END!!! 23001-24001 Download END!!! Cost time 1431.37074164 s24001-25001 download start ... 25001-26001 Download Start ... 26001-27001 Download Start ... 24001-25001 Download END!!! 25001-26001 Download END!!! 26001-27001 Download END!!! Cost time 1411.45186874 s27001-28001 download start ... 28001-29001 Download start ... 29001-30001 Download Start ... 27001-28001 Download END!!! 28001-29001 Download END!!! 29001-30001 Download END!!! Cost time 1396.88837788 s30001-31001 download start ... 31001-32001 Download Start ... 32001-33001 Download Start ... 30001-31001 Download END!!! 31001-32001 Download END!!! 32001-33001 Download END!!! Cost time 1389.01316718 s33001-34001 download start ... 34001-35001 Download Start ... 35001-36001 Download Start ... 33001-34001 Download END!!! 34001-35001 Download END!!! 35001-36001 Download END!!! Cost time 1318.16040825 s36001-37001 download start ... 37001-38001 Download Start ... 38001-39001 Download Start ... 36001-37001 Download END!!! 37001-38001 Download END!!! 38001-39001 Download END!!! Cost time 1362.59222822 s39001-40001 download start ... 40001-41001 Download Start ... 41001-42001 Download Start ... 39001-40001 Download END!!! 40001-41001 Download END!!! 41001-42001 Download END!!!Cost time 1253.62498539 s42001-43001 download start ... 43001-44001 Download Start ... 44001-45001 Download Start ... 42001-43001 Download END!!! 43001-44001 Download END!!! 44001-45001 Download END!!! Cost time 1313.50461988 s45001-46001 download start ... 46001-47001 Download Start ... 47001-48001 Download Start ... 45001-46001 Download END!!! 46001-47001 Download END!!! 47001-48001 Download END!!! Cost time 1322.32317331 s48001-49001 download start ... 49001-50001 Download Start ... 50001-51001 Download Start ... 48001-49001 Download END!!! 49001-50001 Download END!!! 50001-51001 Download END!!! Cost time 1381.58027296 s51001-52001 download start ... 52001-53001 Download Start ... 53001-54001 Download Start ... 51001-52001 Download END!!! 52001-53001 Download END!!! 53001-54001 Download END!!! Cost time 1357.78699459 s54001-55001 download start ... 55001-56001 Download Start ... 56001-57001 Download Start ... 54001-55001 DownloadEnd!!! 55001-56001 Download END!!! 56001-57001 Download END!!! Cost time 1359.76377246 s57001-58001 download start ... 58001-59001 Download Start ... 59001-60001 Download Start ... 57001-58001 Download END!!! 58001-59001 Download END!!! 59001-60001 Download END!!! Cost time 1335.47829775 s60001-61001 download start ... 61001-62001 Download Start ... 62001-63001 Download Start ... 60001-61001 Download END!!! 61001-62001 Download END!!! 62001-63001 Download END!!! Cost time 1354.82727645 s63001-64001 download start ... 64001-65001 Download Start ... 65001-66001 Download Start ... 63001-64001 Download END!!! 64001-65001 Download END!!! 65001-66001 Download END!!! Cost time 1260.54731607 s66001-67001 download start ... 67001-68001 Download Start ... 68001-69001 Download Start ... 66001-67001 Download END!!! 67001-68001 Download END!!! 68001-69001 Download END!!! Cost time 1363.58255686 s69001-70001 download start ... 70001-71001 DowNload start ... 71001-72001 Download Start ... 69001-70001 Download END!!! 70001-71001 Download END!!! 71001-72001 Download END!!! Cost time 1354.17163074 s72001-73001 download start ... 73001-74001 Download Start ... 74001-75001 Download Start ... 72001-73001 Download END!!! 73001-74001 Download END!!! 74001-75001 Download END!!! Cost time 1335.00425259 s75001-76001 download start ... 76001-77001 Download Start ... 77001-78001 Download Start ... 75001-76001 Download END!!! 76001-77001 Download END!!! 77001-78001 Download END!!! Cost time 1360.44054978 s78001-79001 download start ... 79001-80001 Download Start ... 80001-81001 Download Start ... 78001-79001 Download END!!! 79001-80001 Download END!!! 80001-81001 Download END!!! Cost time 1369.72662457 s81001-82001 download start ... 82001-83001 Download Start ... 83001-84001 Download Start ... 81001-82001 Download END!!! 82001-83001 Download END!!! 83001-84001 Download END!!! Cost Time 1369.95550676 s84001-85001 Download start ... 85001-86001 Download Start ... 86001-87001 Download Start ... 84001-85001 Download END!!! 85001-86001 Download END!!! 86001-87001 Download END!!! Cost time 1482.53886433 s87001-88001 download start ... 88001-89001 Download Start ... 89001-90001 Download Start ...
The above is about the Python implementation of the crawler statistics School BBS male and female ratio of the second, focusing on the multi-threaded crawler, hope that everyone's learning has helped.