Python implements crawling statistics on the ratio of BBS to men and women in schools (3)

Last Update:2017-05-14 Source: Internet

Author: User

Tags rows count time 0

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes how to use python to process statistics on the ratio of men and women in the school BBS. For more information, see this article.

I. Data Analysis

We have obtained text data starting with the following string, which needs to be processed

II. Rollback

We need to reprocess the httperror data.

For code reasons, we can see this series of articles (II), which will cause the same id in the text to appear several times in a row httperror record:

//httperror265001_266001.txt265002 httperror265002 httperror265002 httperror265002 httperror265003 httperror265003 httperror265003 httperror265003 httperror

Therefore, we need to consider this situation in the code. we cannot process the IDs of each row, but judge whether the IDs are repeated.

There is a caching method in java to avoid frequent reading of files on the hard disk. python actually does. see this article.

Def main (): reload (sys) sys. setdefaultencoding ('utf-8') global sexRe, timeRe, notexistRe, url1, url2, file1, file2, file3, file4, startNum, endNum, file5 sexRe = re. compile (u'em> \ u6027 \ u522b(.*?)\ U4e0a \ u6b21 \ u6d3b \ u52a8 \ u65f6 \ u95f4(.*?)) \ U62b1 \ u6b49 \ uff0c \ u60a8 \ u6307 \ u5b9a \ u7684 \ u7528 \ u6237 \ u7a7a \ u95f4 \ Users \ u5b58 \ u5728 <') url1 = 'http: // rs.xidian.edu.cn/home.php? Mod = space & uid = % s 'url2 = 'http: // rs.xidian.edu.cn/home.php? Mod = space & uid = % s & do = profile 'file1 = 'ruisi \ correct_re.txt 'file2 = 'ruisi \ errTime_re.txt 'file3 = 'ruisi \ notexist_re.txt 'file4 =' ruisi \ unkownsex_re.txt 'file5 = 'ruisi \ httperror_re.txt '# traverse the text starting with httperror in the folder for filename in OS. listdir (r 'E: \ pythonProject \ ruisi '): if filename. startswith ('httperror '): count = 0 newName = 'E :\\ pythonProject \ ruisi \ % s' % (filename) readFile = open (newName, 'R' ) OldLine = '0' for line in readFile: # newLine is used to compare whether it is a duplicate id newLine = line if (newLine! = OldLine): nu = newLine. split () [0] oldLine = newLine count + = 1 searchWeb (int (nu),) print "% s deal % s lines" % (filename, count)

For convenience, this code does not classify the httperror IDs and stores them in the following five files.

 file1 = 'ruisi\\correct_re.txt'  file2 = 'ruisi\\errTime_re.txt'  file3 = 'ruisi\\notexist_re.txt'  file4 = 'ruisi\\unkownsex_re.txt'  file5 = 'ruisi\\httperror_re.txt'

You can check the output Log records and the total number of httperror data processed.

"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/reload.pyhttperror132001-133001.txt deal 21 lineshttperror2001-3001.txt deal 4 lineshttperror251001-252001.txt deal 5 lineshttperror254001-255001.txt deal 1 lines

3. collect unkownsex data in a single thread

The code is simple. We use a single thread to count the unkownsex users (which cannot be obtained due to permission reasons or this user is not filled in. In addition, after our inspection, there is no activity time for non-gender users.

The data format is as follows:

253042 unkownsex253087 unkownsex253102 unkownsex253118 unkownsex253125 unkownsex253136 unkownsex253161 unkownseximport os,timesumCount = 0startTime = time.clock()for filename in os.listdir(r'E:\pythonProject\ruisi'):  if filename.startswith('unkownsex'):    count = 0    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in open(newName):      count += 1      sumCount +=1    print "%s deal %s lines" %(filename, count)print '%s unkowns sex' %(sumCount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

The processing speed is very fast and the output is as follows:

Unkownsex1-1001.txt deal 204 linesunkownsex100001-101001.txt deal 50 linesunkownsex10001-11001.txt deal 206 lines ...ellipsis intermediate output information unkownsex99001-20.1.txt deal 56 linesunkownsex_re.txt deal 1085 lines14223 unkowns sexcost time 0.0813142301261 s

4. collect correct data in a single thread

The data format is as follows:

31024 male 2014-11-11 13: 2031283 male 2013-3-25 19:4131340 confidentiality 2015-2-2 15: 1731427 confidentiality 2014-8-10 09: 1731475 confidentiality 2013-7-2 08: 5931554 confidentiality 17: 0231621 male 2015-5-16 19: 2731872 confidentiality 16: 4931915 confidentiality 11: 0131997 confidentiality

The code is as follows. The idea is to read data in one row and use line. split () to obtain gender information. SumCount is used to count the number of people. boycount, girlcount, and secretcount are used to count the number of people who are male, female, and confidential respectively. We still use unicode for regular matching.

import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0for filename in os.listdir(r'E:\pythonProject\ruisi'):  if filename.startswith('correct'):    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo = line.split()[1]      sumCount +=1      if sexInfo == u'\u7537' :        boycount += 1      elif sexInfo == u'\u5973':        girlcount +=1      elif sexInfo == u'\u4fdd\u5bc6':        secretcount +=1    print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

Note: We output the statistics of a specific file, rather than the statistics of a single file. The output result is as follows:

Until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret; until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret ;#... omit until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret; until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret; total is 46885; 13937 boys; 4007 girls; 28941 secret; cost time 3.60047888495 s

5. multithreading statistics

To make statistics faster, we can use multithreading.
For comparison, it takes time to try to place a single thread.

# Encoding: UTF-8import threadingimport time, OS, sys # global variable SUM = 0BOY = 0 GIRL = 0 SECRET = 0NUM = 0 # originally inherited from threading. thread, overwrite the run () method, start the Thread with start () # This is similar to class StaFileList (threading. thread): # Text name list fileList = [] def _ init _ (self, fileList): threading. thread. _ init _ (self) self. fileList = fileList def run (self): global SUM, BOY, GIRL, SECRET # the time consumed can be added. This makes multithreading more obvious, rather than sequential thread-1, 2, 3 # time. sleep (1) # acqu Ire Gets the lock if mutex. acquire (1): self. staFiles (self. fileList) # release lock mutex. release () # process the input files list and count men and women # pay attention to the data synchronization problem here. global uses the global variable def staFiles (self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E: \ pythonProject \ ruisi \ % s' % (name) readFile = open (newName, 'r ') for line in readFile: sexInfo = line. split () [1] SUM + = 1 if sexInfo = u' \ u7537': BOY + = 1 elif sexInfo = u' \ u597 3 ': GIRL + = 1 elif sexInfo = u' \ u4fdd \ u5bc6': SECRET + = 1 # print "thread % s, until % s, total is % s; % s boys; % s girls; "\ #" % s secret; "% (self. name, name, SUM, BOY, GIRL, SECRET) def test (): # files stores multiple files, you can set the number of files processed by a thread = [] # to save all the threads, so that the last main thread can wait. Therefore, the sub-thread ends staThreads = [] I = 0 for filename in OS. listdir (r 'E: \ pythonProject \ ruisi '): # Create a thread if filename. startswith ('correct'): files. append (Filename) I + = 1 # One thread processes 20 files if I = 20: staThreads. append (StaFileList (files) files = [] I = 0 # the remaining files are probably less than 10 if files: staThreads. append (StaFileList (files) for t in staThreads: t. start () # wait for all sub-threads to exit in the main thread. if this is not added, is the speed faster? For t in staThreads: t. join () if _ name _ = '_ main _': reload (sys) sys. setdefaultencoding ('utf-8') startTime = time. clock () mutex = threading. lock () test () print "Multi Thread, total is % s; % s boys; % s girls; % s secret;" % (SUM, BOY, GIRL, SECRET) endTime = time. clock () print "cost time" + str (endTime-startTime) + "s"

Output

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 0.132137192794 s

We found that the time is similar to that of a single thread. Because the thread synchronization problem is involved here, obtaining and releasing locks both require time overhead. switching between threads to save the interruption and restoring the interruption also require time overhead.

6. Comparison between single-thread and multi-thread with more data

We can process the correct, errTime, and unkownsex texts.
Single-thread code

# Coding = utf-8import OS, sys, timereload (sys) sys. setdefaultencoding ('utf-8') startTime = time. clock () sumCount = 0 boycount = 0 girlcount = 0 secretcount = 0 unkowncount = 0for filename in OS. listdir (r 'E: \ pythonProject \ ruisi '): # if filename. startswith ('correct'): newName = 'E: \ pythonProject \ ruisi \ % s' % (filename) readFile = open (newName, 'r ') for line in readFile: sexInfo = line. split () [1] sumCount + = 1 if sexInfo = u' \ u7537': boycount + = 1 elif sexInfo = u' \ u5973 ': girlcount + = 1 elif sexInfo = u' \ u4fdd \ u5bc6 ': secretcount + = 1 # print "until % s, sum is % s boys; % s girls; % s secret; "% (filename, boycount, girlcount, secretcount) # no activity time, but there is a gender elif filename. startswith ("errTime"): newName = 'E: \ pythonProject \ ruisi \ % s' % (filename) readFile = open (newName, 'r ') for line in readFile: sexInfo = line. split () [1] sumCount + = 1 if sexInfo = u' \ u7537': boycount + = 1 elif sexInfo = u' \ u5973 ': girlcount + = 1 elif sexInfo = u' \ u4fdd \ u5bc6 ': secretcount + = 1 # print "until % s, sum is % s boys; % s girls; % s secret; "% (filename, boycount, girlcount, secretcount) # no gender or time, directly count the number of rows elif filename. startswith ("unkownsex"): newName = 'E :\\ pythonProject \ ruisi \ % s' % (filename) # count = len (open (newName, 'RU '). readlines () # for large files, use the cyclic method. The initial value of count is-1 to cope with empty rows. finally, + 1 gets the 0 rows count =-1 for count, line in enumerate (open (newName, 'rule'): pass count + = 1 unkowncount + = count sumCount + = count # print "until % s, sum is % s unkownsex "% (filename, unkowncount) print" Single Thread, total is % s; % s boys; % s girls; % s secret; % s unkownsex; "% (sumCount, boycount, girlcount, secretcount, unkowncount) endTime = time. clock () print "cost time" + str (endTime-startTime) + "s"

The output is

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
Cost time 1.37444645628 s

Multi-threaded code

_ Author _ = 'admin' # encoding: UTF-8 # multithreading handler import threadingimport time, OS, sys # global variable SUM = 0BOY = 0 GIRL = 0 SECRET = 0 UNKOWN = 0 class StaFileList (threading. thread): # Text name list fileList = [] def _ init _ (self, fileList): threading. thread. _ init _ (self) self. fileList = fileList def run (self): global SUM, BOY, GIRL, SECRET if mutex. acquire (1): self. staManyFiles (self. fileList) mutex. release () # process the input files list and count the number of male and female users # note the data synchronization problem def staCorrectFiles (self, files): global SUM, BOY, GIRL, SECRET for name in files: newName = 'E :\\ pythonProject \ ruisi \ % s' % (name) readFile = open (newName, 'r') for line in readFile: sexInfo = line. split () [1] SUM + = 1 if sexInfo = u' \ u7537': BOY + = 1 elif sexInfo = u' \ u5973 ': GIRL + = 1 elif sexInfo = u' \ u4fdd \ u5bc6 ': SECRET + = 1 # print "thread % s, until % s, total is % s; % s boys; % s girls; "\ #" % s secret; "% (self. name, name, SUM, BOY, GIRL, SECRET) def staManyFiles (self, files): global SUM, BOY, GIRL, SECRET, UNKOWN for name in files: if name. startswith ('correct'): newName = 'E: \ pythonProject \ ruisi \ % s' % (name) readFile = open (newName, 'r ') for line in readFile: sexInfo = line. split () [1] SUM + = 1 if sexInfo = u' \ u7537': BOY + = 1 elif sexInfo = u' \ u5973 ': GIRL + = 1 elif sexInfo = u' \ u4fdd \ u5bc6 ': SECRET + = 1 # print "thread % s, until % s, total is % s; % s boys; % s girls; "\ #" % s secret; "% (self. name, name, SUM, BOY, GIRL, SECRET) # no activity time, but elif name is gender. startswith ("errTime"): newName = 'E: \ pythonProject \ ruisi \ % s' % (name) readFile = open (newName, 'r ') for line in readFile: sexInfo = line. split () [1] SUM + = 1 if sexInfo = u' \ u7537': BOY + = 1 elif sexInfo = u' \ u5973 ': GIRL + = 1 elif sexInfo = u' \ u4fdd \ u5bc6 ': SECRET + = 1 # print "thread % s, until % s, total is % s; % s boys; % s girls; "\ #" % s secret; "% (self. name, name, SUM, BOY, GIRL, SECRET) # no gender, no time, and the number of elif names is directly counted. startswith ("unkownsex"): newName = 'E :\\ pythonProject \ ruisi \ % s' % (name) # count = len (open (newName, 'RU '). readlines () # for large files, use the cyclic method. The initial value of count is-1 to cope with empty rows. finally, + 1 gets the 0 rows count =-1 for count, line in enumerate (open (newName, 'Ru '): pass count + = 1 UNKOWN + = count SUM + = count # print "thread % s, until % s, total is % s; % s unkownsex "% (self. name, name, SUM, UNKOWN) def test (): files = [] # used to save all threads, it is convenient for the last primary thread to wait, so the subthread ends staThreads = [] I = 0 for filename in OS. listdir (r 'E: \ pythonProject \ ruisi '): # Create a thread if filename. startswith ("correct") or filename. startswith ("errTime") or filename. startswith ("unkownsex"): files. append (filename) I + = 1 if I = 20: staThreads. append (StaFileList (files) files = [] I = 0 # the remaining files are probably less than 10 if files: staThreads. append (StaFileList (files) for t in staThreads: t. start () # wait for all sub-threads to exit in the main thread for t in staThreads: t. join () if _ name _ = '_ main _': reload (sys) sys. setdefaultencoding ('utf-8') startTime = time. clock () mutex = threading. lock () test () print "Multi Thread, total is % s; % s boys; % s girls; % s secret; % s unkownsex" % (SUM, BOY, GIRL, SECRET, UNKOWN) endTime = time. clock () print "cost time" + str (endTime-startTime) + "s" endTime = time. clock () print "cost time" + str (endTime-startTime) + "s"

The output is

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
Cost time 1.23049112201 s
We can see that multithreading is better than a single thread. due to synchronization, data statistics are always there.

Note that python often requires self in the class, which is very different from java.

Def _ init _ (self, fileList): threading. thread. _ init _ (self) self. fileList = fileList def run (self): global SUM, BOY, GIRL, SECRET if mutex. acquire (1): # self must be added to call internal methods of the class. staFiles (self. fileList) mutex. release ()

Total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
Cost time 1.25413238673 s

The above is all the content of this article, hoping to help you learn.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More