This article mainly introduces how to implement multi-thread crawling by using python to count the ratio of men and women in the BBS in the school. If you are interested, refer to the article I will continue to learn.
I. Data Classification
Correct data: id, gender, and activity time
Put it in this file: file1 = 'ruisi \ correct%s-%s.txt '% (startNum, endNum)
Data format: 293001 male
- No time: id, gender, and activity time
Put file2 in this file = 'ruisi \ errtime1_s-20.s.txt '% (startNum, endNum)
The data format is 2566 notime.
- User does not exist: This id does not have a corresponding user
Put file3 = 'ruisi \ notexistw.s-20.s.txt '% (startNum, endNum) in this file)
The data format is 29005 notexist.
- Unknown Gender: There is an id, but gender cannot be known from the web page (after examination, there is no activity time in this case)
Put file4 in this file = 'ruisi \ unkownsexw.s-20.s.txt '% (startNum, endNum)
Data format: 221794 unkownsex
- Network error: the network is disconnected or the server is faulty. You need to re-check these IDs.
Put file5 = 'ruisi \ httperror%s-%s.txt '% (startNum, endNum) in this file)
Data format 271004 httperror
How to continuously obtain crawler Information
- This project has one consideration: it is a continuous crawling of information. If the network is disconnected or the BBS server is faulty, my crawler will quit. We have to continue crawling from the ground up, or, more importantly, from the ground up.
- Therefore, the method I take is to record the IDs of these exceptions if a fault occurs. After a traversal, the IDs of these exceptions are re-crawled to obtain the gender.
- This article series (1) provides a getInfo (myurl, seWord) that crawls information through a given link and a given regular expression.
- This function can be used to view the last activity time of gender.
- Let's define another safe crawling function that will not interrupt the program running. Here we use try again t exception handling.
The Code has tried getInfo (myurl, seWord) twice. if an exception is thrown for 2nd times, save this id in file5.
If you can obtain the information, it will return the information
file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)def safeGet(myid, myurl, seWord): try: return getInfo(myurl, seWord) except: try: return getInfo(myurl, seWord) except: httperrorfile = open(file5, 'a') info = '%d %s\n' % (myid, 'httperror') httperrorfile.write(info) httperrorfile.close() return 'httperror'
Traverse in order to obtain the user information of the id from [1,300,000]
We define a function. The idea here is to get sex and time. If there is sex, we can continue to judge whether there is time. If there is no sex, we can determine whether the user does not exist or the gender cannot crawl.
The Network disconnection or BBS Server failure should be taken into account.
Url1 =' http://rs.xidian.edu.cn/home.php?mod=space&uid=%s 'Url2 =' http://rs.xidian.edu.cn/home.php?mod=space&uid=%s & Do = profile 'def searchWeb (idArr): for id in idArr: sexUrl = url1 % (id) # Replace % s with id timeUrl = url2 % (id) sex = safeGet (id, sexUrl, sexRe) if not sex: # if no sex is found in sexUrl, try again in timeUrl sex = safeGet (id, timeUrl, sexRe) time = safeGet (id, timeUrl, timeRe) # if httperror occurs, you need to re-crawl if (sex is 'httperror ') or (time is 'httperror'): pass else: if sex: info = '% d % s' % (id, sex) if time: info =' % s \ n' % (info, time) wfile = open (file1, 'A') wfile. write (info) wfile. close () else: info = '% s \ n' % (info, 'notime') errtimefile = open (file2, 'A') errtimefile. write (info) errtimefile. close () else: # Here the gender is None, and then determine if the user does not exist # adding this when the network is disconnected will lead to four repeated httperror # We may not know the user's gender, he does not fill in notexist = safeGet (id, sexUrl, notexistRe) if notexist is 'httperror ': pass else: if notexist: notexistfile = open (file3, 'A ') info = '% d % s \ n' % (id, 'notexist') notexistfile. write (info) notexistfile. close () else: unkownsexfile = open (file4, 'A') info = '% d % s \ n' % (id, 'unkownsex') unkownsexfile. write (info) unkownsexfile. close ()
A problem was found during subsequent checks.
sex = safeGet(id,sexUrl, sexRe) if not sex: sex = safeGet(id,timeUrl, sexRe) time = safeGet(id,timeUrl, timeRe)
If this code is called three times when the network is disconnected, each call will write httperror multiple times to the same id in the text.
251538 httperror251538 httperror251538 httperror251538 httperror
Multi-thread crawling information?
Multi-thread can be used for data statistics, because multiple texts are independent.
1. Introduction to Popen
You can use Popen to customize standard input, standard output, and standard error output. During my SAP internship, the project team often used Popen on the linux platform, probably because it was easy to redirect the output.
The following code draws on the implementation method of the previous project team. Popen can call the system cmd command. The following three communicate () connections indicate that the three threads will end.
Confused?
After testing, three communicate () threads must be placed next to each other to ensure that the three threads are enabled at the same time. Finally, the three threads will end.
p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)p1.communicate()p2.communicate()p3.communicate()
2. Define a single-threaded Crawler
Usage: python ruisi. py
This code crawls [startNum, endNum) information and outputs it to the corresponding text. It is a single-threaded program. To implement multithreading, you can call it externally to implement multithreading.
# Ruisi. py # coding = utf-8import urllib2, re, sys, threading, time, thread # myurl as specified link # seWord as regular expression, unicode representation # return information that matches the regular expression or Nonedef getInfo (myurl, seWord): headers = {'user-agent': 'mozilla/5.0 (Windows; U; windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6 '} req = urllib2.Request (url = myurl, headers = headers) time. sleep (0.3) response = urllib2.urlopen (req) html = response. re Ad () html = unicode (html, 'utf-8') timeMatch = seWord. search (html) if timeMatch: s = timeMatch. groups () return s [0] else: return None # Try getInfo () twice # After 2nd failures, mark this id as httperrordef safeGet (myid, myurl, seWord): try: return getInfo (myurl, seWord) unique T: try: return getInfo (myurl, seWord) unique T: httperrorfile = open (file5, 'A ') info = '% d % s \ n' % (myid, 'httpererror') httperrorfile. write (info) httperror File. close () return 'httperror '# output an idArr range, for example, [) def searchWeb (idArr): for id in idArr: sexUrl = url1 % (id) timeUrl = url2 % (id) sex = safeGet (id, sexUrl, sexRe) if not sex: sex = safeGet (id, timeUrl, sexRe) time = safeGet (id, timeUrl, timeRe) if (sex is 'httperror ') or (time is 'httperror'): pass else: if sex: info = '% d % s' % (id, sex) if time: info = '% s \ n' % (info, time) wfile = open (File1, 'A') wfile. write (info) wfile. close () else: info = '% s \ n' % (info, 'notime') errtimefile = open (file2, 'A') errtimefile. write (info) errtimefile. close () else: notexist = safeGet (id, sexUrl, notexistRe) if notexist is 'httperror ': pass else: if notexist: notexistfile = open (file3, 'A ') info = '% d % s \ n' % (id, 'notexist') notexistfile. write (info) notexistfile. close () else: unkownsexfile = op En (file4, 'A') info = '% d % s \ n' % (id, 'unkownsex') unkownsexfile. write (info) unkownsexfile. close () def main (): reload (sys) sys. setdefaultencoding ('utf-8') if len (sys. argv )! = 3: print 'usage: python ruisi. py
'Sys. exit (-1) global sexRe, timeRe, notexistRe, url1, url2, file1, file2, file3, file4, startNum, endNum, file5 startNum = int (sys. argv [1]) endNum = int (sys. argv [2]) sexRe = re. compile (u'em> \ u6027 \ u522b(.*?)\ U4e0a \ u6b21 \ u6d3b \ u52a8 \ u65f6 \ u95f4(.*?)) \ U62b1 \ u6b49 \ uff0c \ u60a8 \ u6307 \ u5b9a \ u7684 \ u7528 \ u6237 \ u7a7a \ u95f4 \ Users \ u5b58 \ u5728 <') url1 = 'HTTP: // rs.xidian.edu.cn/home.php? Mod = space & uid = % s 'url2 = 'HTTP: // rs.xidian.edu.cn/home.php? Mod = space & uid = % s & do = profile 'file1 = '.. \ newRuisi \ correctw.s-20.s.txt '% (startNum, endNum) file2 = '.. \ newRuisi \ errtimestamps s-0000s.txt '% (startNum, endNum) file3 = '.. \ newRuisi \ notexistw.s-20.s.txt '% (startNum, endNum) file4 = '.. \ newRuisi \ unkownsexw.s-w.s.txt '% (startNum, endNum) file5 = '.. \ newRuisi \ httperrorw.s-w.s.txt '% (startNum, endNum) searchWeb (xrange (startNum, endNum) # numThread = 10 # searchWeb (xrange (endNum )) # total = 0 # for I in xrange (numThread): # data = xrange (1 + I, endNum, numThread) # total = + len (data) # t = threading. thread (target = searchWeb, args = (data,) # t. start () # print totalmain ()
Multi-thread Crawler
Code
# Coding = utf-8from subprocess import Popenimport subprocessimport threading, timestartn = 1 endn = 300001 step = 1000 total = (endn-startn + 1) /stepISOTIMEFORMAT = '% Y-% m-% d % x' # hardcode 3 threads # No further study on whether 3 threads are good or 4 or more threads are good # output formatted year month day hour, minute, and second # Time consumed by the output program (in seconds) for I in xrange (0, total, 3): startNumber = startn + step * I startTime = time. clock () s0 = startNumber s1 = startNumber + step s2 = startNumber + step * 2 s3 = st ArtNumber + step * 3 p1 = Popen (['python', 'ruisi. py', str (s0), str (s1)], bufsize = 10000, stdout = subprocess. PIPE) p2 = Popen (['python', 'ruisi. py', str (s1), str (s2)], bufsize = 10000, stdout = subprocess. PIPE) p3 = Popen (['python', 'ruisi. py', str (s2), str (s3)], bufsize = 10000, stdout = subprocess. PIPE) startftime = '[' + time. strftime (ISOTIMEFORMAT, time. localtime () + '] 'print startftime +' % s-% s download start... '% (S0, s1) print startftime +' % s-% s download start... '% (s1, s2) print startftime +' % s-% s download start... '% (s2, s3) p1.communicate () p2.communicate () p3.communicate () endftime =' ['+ time. strftime (ISOTIMEFORMAT, time. localtime () + '] 'print endftime +' % s-% s download end !!! '% (S0, s1) print endftime +' % s-% s download end !!! '% (S1, s2) print endftime +' % s-% s download end !!! '% (S2, s3) endTime = time. clock () print "cost time" + str (endTime-startTime) + "s" time. sleep (5)
Here is the log that records the timestamp:
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py[ 2015-11-23 11:31:15 ] 1 - 1001 download start... [ 2015-11-23 11:31:15 ] 1001 - 2001 download start... [ 2015-11-23 11:31:15 ] 2001 - 3001 download start... [ 2015-11-23 11:53:44 ] 1 - 1001 download end !!! [ 2015-11-23 11:53:44 ] 1001 - 2001 download end !!! [ 2015-11-23 11:53:44 ] 2001 - 3001 download end !!! cost time 1348.99480677 s[ 2015-11-23 11:53:50 ] 3001 - 4001 download start... [ 2015-11-23 11:53:50 ] 4001 - 5001 download start... [ 2015-11-23 11:53:50 ] 5001 - 6001 download start... [ 2015-11-23 12:16:56 ] 3001 - 4001 download end !!! [ 2015-11-23 12:16:56 ] 4001 - 5001 download end !!! [ 2015-11-23 12:16:56 ] 5001 - 6001 download end !!! cost time 1386.06407734 s[ 2015-11-23 12:17:01 ] 6001 - 7001 download start... [ 2015-11-23 12:17:01 ] 7001 - 8001 download start... [ 2015-11-23 12:17:01 ] 8001 - 9001 download start...
The above is a multi-threaded Log record. It can be seen from the following that 1000 users require an average of 500 s and an id requires 0.5 s. 500*300/3600 = 41.666666666667 hours, which takes about two days.
The time consumed by a single-thread crawler is record as follows:
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py1 - 1001 download start... 1 - 1001 download end !!! cost time 1583.65911889 s1001 - 2001 download start... 1001 - 2001 download end !!! cost time 1342.46874278 s2001 - 3001 download start... 2001 - 3001 download end !!! cost time 1327.10885725 s3001 - 4001 download start...
We found that it would take 1000 s to crawl 1500 users at a time, while the multithreading program would consume 1000 s for 3*1500 users.
Therefore, multithreading can save much time than a single thread.
Note:
In getInfo (myurl, seWord), there is a piece of code such as time. sleep (0.3), which is used to prevent critical access to BBS and is denied by BBS. This s has an impact on the statistical time of multithreading and single threads.
The original record with no timestamp is attached. (With the timestamp added, you can know when the program starts crawling to cope with thread freezing .)
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py1 - 1001 download start... 1001 - 2001 download start... 2001 - 3001 download start... 1 - 1001 download end !!! 1001 - 2001 download end !!! 2001 - 3001 download end !!! cost time 1532.74102812 s3001 - 4001 download start... 4001 - 5001 download start... 5001 - 6001 download start... 3001 - 4001 download end !!! 4001 - 5001 download end !!! 5001 - 6001 download end !!! cost time 2652.01624951 s6001 - 7001 download start... 7001 - 8001 download start... 8001 - 9001 download start... 6001 - 7001 download end !!! 7001 - 8001 download end !!! 8001 - 9001 download end !!! cost time 1880.61513696 s9001 - 10001 download start... 10001 - 11001 download start... 11001 - 12001 download start... 9001 - 10001 download end !!! 10001 - 11001 download end !!! 11001 - 12001 download end !!! cost time 1634.40575553 s12001 - 13001 download start... 13001 - 14001 download start... 14001 - 15001 download start... 12001 - 13001 download end !!! 13001 - 14001 download end !!! 14001 - 15001 download end !!! cost time 1403.62795496 s15001 - 16001 download start... 16001 - 17001 download start... 17001 - 18001 download start... 15001 - 16001 download end !!! 16001 - 17001 download end !!! 17001 - 18001 download end !!! cost time 1271.42177906 s18001 - 19001 download start... 19001 - 20001 download start... 20001 - 21001 download start... 18001 - 19001 download end !!! 19001 - 20001 download end !!! 20001 - 21001 download end !!! cost time 1476.04122024 s21001 - 22001 download start... 22001 - 23001 download start... 23001 - 24001 download start... 21001 - 22001 download end !!! 22001 - 23001 download end !!! 23001 - 24001 download end !!! cost time 1431.37074164 s24001 - 25001 download start... 25001 - 26001 download start... 26001 - 27001 download start... 24001 - 25001 download end !!! 25001 - 26001 download end !!! 26001 - 27001 download end !!! cost time 1411.45186874 s27001 - 28001 download start... 28001 - 29001 download start... 29001 - 30001 download start... 27001 - 28001 download end !!! 28001 - 29001 download end !!! 29001 - 30001 download end !!! cost time 1396.88837788 s30001 - 31001 download start... 31001 - 32001 download start... 32001 - 33001 download start... 30001 - 31001 download end !!! 31001 - 32001 download end !!! 32001 - 33001 download end !!! cost time 1389.01316718 s33001 - 34001 download start... 34001 - 35001 download start... 35001 - 36001 download start... 33001 - 34001 download end !!! 34001 - 35001 download end !!! 35001 - 36001 download end !!! cost time 1318.16040825 s36001 - 37001 download start... 37001 - 38001 download start... 38001 - 39001 download start... 36001 - 37001 download end !!! 37001 - 38001 download end !!! 38001 - 39001 download end !!! cost time 1362.59222822 s39001 - 40001 download start... 40001 - 41001 download start... 41001 - 42001 download start... 39001 - 40001 download end !!! 40001 - 41001 download end !!! 41001 - 42001 download end !!! cost time 1253.62498539 s42001 - 43001 download start... 43001 - 44001 download start... 44001 - 45001 download start... 42001 - 43001 download end !!! 43001 - 44001 download end !!! 44001 - 45001 download end !!! cost time 1313.50461988 s45001 - 46001 download start... 46001 - 47001 download start... 47001 - 48001 download start... 45001 - 46001 download end !!! 46001 - 47001 download end !!! 47001 - 48001 download end !!! cost time 1322.32317331 s48001 - 49001 download start... 49001 - 50001 download start... 50001 - 51001 download start... 48001 - 49001 download end !!! 49001 - 50001 download end !!! 50001 - 51001 download end !!! cost time 1381.58027296 s51001 - 52001 download start... 52001 - 53001 download start... 53001 - 54001 download start... 51001 - 52001 download end !!! 52001 - 53001 download end !!! 53001 - 54001 download end !!! cost time 1357.78699459 s54001 - 55001 download start... 55001 - 56001 download start... 56001 - 57001 download start... 54001 - 55001 download end !!! 55001 - 56001 download end !!! 56001 - 57001 download end !!! cost time 1359.76377246 s57001 - 58001 download start... 58001 - 59001 download start... 59001 - 60001 download start... 57001 - 58001 download end !!! 58001 - 59001 download end !!! 59001 - 60001 download end !!! cost time 1335.47829775 s60001 - 61001 download start... 61001 - 62001 download start... 62001 - 63001 download start... 60001 - 61001 download end !!! 61001 - 62001 download end !!! 62001 - 63001 download end !!! cost time 1354.82727645 s63001 - 64001 download start... 64001 - 65001 download start... 65001 - 66001 download start... 63001 - 64001 download end !!! 64001 - 65001 download end !!! 65001 - 66001 download end !!! cost time 1260.54731607 s66001 - 67001 download start... 67001 - 68001 download start... 68001 - 69001 download start... 66001 - 67001 download end !!! 67001 - 68001 download end !!! 68001 - 69001 download end !!! cost time 1363.58255686 s69001 - 70001 download start... 70001 - 71001 download start... 71001 - 72001 download start... 69001 - 70001 download end !!! 70001 - 71001 download end !!! 71001 - 72001 download end !!! cost time 1354.17163074 s72001 - 73001 download start... 73001 - 74001 download start... 74001 - 75001 download start... 72001 - 73001 download end !!! 73001 - 74001 download end !!! 74001 - 75001 download end !!! cost time 1335.00425259 s75001 - 76001 download start... 76001 - 77001 download start... 77001 - 78001 download start... 75001 - 76001 download end !!! 76001 - 77001 download end !!! 77001 - 78001 download end !!! cost time 1360.44054978 s78001 - 79001 download start... 79001 - 80001 download start... 80001 - 81001 download start... 78001 - 79001 download end !!! 79001 - 80001 download end !!! 80001 - 81001 download end !!! cost time 1369.72662457 s81001 - 82001 download start... 82001 - 83001 download start... 83001 - 84001 download start... 81001 - 82001 download end !!! 82001 - 83001 download end !!! 83001 - 84001 download end !!! cost time 1369.95550676 s84001 - 85001 download start... 85001 - 86001 download start... 86001 - 87001 download start... 84001 - 85001 download end !!! 85001 - 86001 download end !!! 86001 - 87001 download end !!! cost time 1482.53886433 s87001 - 88001 download start... 88001 - 89001 download start... 89001 - 90001 download start...
The above is the second article about how to use python to calculate the ratio of men and women in the school BBS. It focuses on multi-threaded crawlers and hopes to help you learn.