Python realizes the data processing of the male and female ratio of the spider Statistics School BBS (iii)

Source: Internet
Author: User
Tags exit in readfile rows count time 0
This article mainly introduced the data processing aspect content, hoped that everybody reads carefully.

First, data analysis

We get the text data that starts with the following string and we need to handle it.

Second, Roll back

We need to re-process the httperror data.

Because of the reason for the code, specifically visible in this series of articles (ii), will cause the text inside the same ID a few consecutive Httperror records:

httperror265001_266001.txt265002 httperror265002 httperror265002 httperror265002 httperror265003 httperror265003 httperror265003 httperror265003 Httperror

So we have to consider this situation in the code, not the ID of each row is processed, is to determine whether duplicate ID.

Java has a cache method to avoid frequent reading of files on the hard disk, Python actually has, you can see this article.

def main (): Reload (SYS) sys.setdefaultencoding (' Utf-8 ') global Sexre,timere,notexistre,url1,url2,file1,file2,file3, File4,startnum,endnum,file5 sexre = re.compile (U ' em>\u6027\u522b(.*?)\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4(.*?)) \u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728< ') URL1 = ' http// rs.xidian.edu.cn/home.php?mod=space&uid=%s ' url2 = ' http://rs.xidian.edu.cn/home.php?mod=space&uid=%s &do=profile ' file1 = ' ruisi\\correct_re.txt ' file2 = ' ruisi\\errtime_re.txt ' file3 = ' ruisi\\notexist_re.txt ' file 4 = ' ruisi\\unkownsex_re.txt ' file5 = ' ruisi\\httperror_re.txt ' #遍历文件夹里面以httperror开头的文本 for filename in Os.listdir (R ' E: \pythonproject\ruisi '): If Filename.startswith (' Httperror '): Count = 0 newName = ' e:\\pythonproject\\ruisi\\% S '% (filename) readFile = open (NewName, ' r ') Oldline = ' 0 ' for line in ReadFile: #newLine used to compare duplicates          ID newLine = line if (newLine! = oldline): Nu = newline.split () [0] Oldline = newLine Count + = 1 searchweb ((int (NU))) print "%s deal%s lines"% (filename, count)

This code for the sake of simplicity, no longer httperror those ID categories, directly stored in the following 5 files

file1 = ' Ruisi\\correct_re.txt '  file2 = ' ruisi\\errtime_re.txt '  file3 = ' ruisi\\notexist_re.txt '  file4 = ' Ruisi\\unkownsex_re.txt '  file5 = ' Ruisi\\httperror_re.txt '

You can look at the output log records and how many httperror of data are processed in total.

"D:\Program Files\python27\python.exe" e:/pythonproject/webcrawler/reload.pyhttperror132001-133001.txt deal 21 Lineshttperror2001-3001.txt deal 4 Lineshttperror251001-252001.txt deal 5 Lineshttperror254001-255001.txt deal 1 lines

Three, single-threaded statistics unkownsex data

The code is simple, and we use single-threaded statistics for the unkownsex (which cannot be obtained due to permission reasons, or that the user did not fill out). In addition, after our examination, no sex users are also no time for activities.

The data format is as follows:

253042 unkownsex253087 unkownsex253102 unkownsex253118 unkownsex253125 unkownsex253136 unkownsex253161 Unkownseximport Os,timesumcount = 0startTime = Time.clock () for filename in Os.listdir (R ' E:\pythonProject\ruisi '):  If Filename.startswith (' Unkownsex '):    count = 0    newName = ' e:\\pythonproject\\ruisi\\%s '% (filename)    ReadFile = open (NewName, ' R ') for line in    Open (newName):      count + = 1      sumcount +=1    print "%s deal%s lines"%  (filename, count) print '%s unkowns sex '% (sumcount) EndTime = Time.clock () print "Cost time" + str (endtime-starttime) + " S

The processing speed is very fast and the output is as follows:

Unkownsex1-1001.txt deal 204 Linesunkownsex100001-101001.txt deal-linesunkownsex10001-11001.txt deal 206 lines# ... Omit intermediate output information Unkownsex99001-100001.txt deal Linesunkownsex_re.txt deal 1085 lines14223 unkowns sexcost time 0.0813142301261 s

Iv. single-thread statistics correct data

The data format is as follows:

31024 men 2014-11-11 13:2,031,283 men 2013-3-25 19:4,131,340 confidentiality 2015-2-2 15:1,731,427 secrecy 2014-8-10 09:1,731,475 secrecy 2013-7-2 08:5,931,554 Confidential 2014-10-17 17:0231621 men 2015-5-16 19:2,731,872 confidentiality 2015-1-11 16:4,931,915 confidentiality 2014-5-4 11:0131997 secrecy 2015-5-16 20:14

The code is as follows, the idea is to read one line at a line, using Line.split () to obtain gender information. Sumcount is a statistic of how many people, Boycount, Girlcount, secretcount respectively statistics male, female, confidential number. We still use Unicode for regular matching.

Import os,sys,timereload (SYS) sys.setdefaultencoding (' utf-8 ') StartTime = Time.clock () Sumcount = 0boycount = 0girlcount = 0secretcount = 0for filename in Os.listdir (R ' E:\pythonProject\ruisi '):  if Filename.startswith (' correct '):    NewName = ' e:\\pythonproject\\ruisi\\%s '% (filename)    readFile = open (NewName, ' R ') for line in    ReadFile:      Sexinfo = Line.split () [1]      sumcount +=1      if sexinfo = U ' \u7537 ':        boycount + = 1      elif sexinfo = U ' \u5973 ': C9/>girlcount +=1      elif Sexinfo = = U ' \u4fdd\u5bc6 ':        secretcount +=1    print ' until%s, sum is%s boys; %s girls; %s secret; "% (filename, boycount,girlcount,secretcount) print" Total is%s; %s Boys; %s girls;  %s secret; "% (Sumcount, boycount,girlcount,secretcount) EndTime = Time.clock () print" Cost time "+ str (endtime-starttime) + "S"

Note that we are outputting statistics for a file, not individual files. The output results are as follows:

Until Correct1-1001.txt, sum is boys; 7 girls; 414 Secret;until correct100001-101001.txt, sum is boys; Girls; 542 secret;# ... Omit until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 Secret;total is 46885; 13937 Boys; 4007 girls; 28941 secret;cost Time 3.60047888495 s

Five, multi-threaded statistical data

For faster statistics, we can take advantage of multithreading.
As a comparison, we try to take the time for a single thread.

# encoding:utf-8import threadingimport time,os,sys# global variable sum = 0BOY = 0GIRL = 0SECRET = 0NUM =0# originally inherited from Threading. Thread, which overrides the run () method, starts the thread with start () # This is like a class Stafilelist (threading) in Java. Thread): #文本名称列表 fileList = [] def __init__ (self, fileList): Threading. Thread.__init__ (self) self.filelist = FileList def run (self): Global SUM, Boy, GIRL, SECRET #可以加上个耗时时间, so more threads are more visible , rather than sequential thread-1,2,3 #time. Sleep (1) #acquire获取锁 if Mutex.acquire (1): Self.stafiles (self.filelist) #relea  SE release lock mutex.release () #处理输入的files列表, statistics on the number of men and women #注意这儿数据同步问题, Global uses Def stafiles (self, files): Global SUM, Boy, GIRL, SECRET for name in Files:newname = ' e:\\pythonproject\\ruisi\\%s '% (name) ReadFile = open (NewName, ' r ') for line in readfile:sexinfo = Line.split () [1] SUM +=1 if sexinfo = = U ' \u7537 ': BO      Y + = 1 elif Sexinfo = = U ' \u5973 ': GIRL +=1 elif Sexinfo = = U ' \u4fdd\u5bc6 ': SECRET +=1 # print "thRead%s, until%s, total is%s; %s Boys; %s girls; "\ #"%s secret; "% (Self.name, name, SUM, Boy,girl,secret) def test (): #files保存多个文件, you can set how many files a thread will handle fi Les = [] #用来保存所有的线程, convenient for the last main thread to wait so the child thread ends stathreads = [] i = 0 for the filename in Os.listdir (R ' E:\pythonProject\ruisi '): # If you do not get 10 text, create a thread if Filename.startswith (' correct '): files.append (filename) i+=1 #一个线程处理20个文件 If i = = 20:stathreads.append (stafilelist (files)) files = [] i = 0 #最后剩余的files, it is possible that the length is less than 10 if files:s  Tathreads.append (stafilelist (files)) for T in Stathreads:t.start () # Wait for all child threads to exit in the main thread, if not add this, faster? For T in Stathreads:t.join () if __name__ = = ' __main__ ': Reload (SYS) sys.setdefaultencoding (' utf-8 ') StartTime = time . Clock () Mutex = threading. Lock () test () print "Multi Thread, Total is%s; %s Boys; %s girls; %s secret; "% (SUM, boy,girl,secret) EndTime = Time.clock () print" Cost time "+ str (endtime-starttime) +" s "

Output

Multi Thread, Total is 46885; 13937 Boys; 4007 girls; 28941 secret;cost Time 0.132137192794 s

We find that time is about the same as a single thread. Because of the thread synchronization problem here, it takes time to acquire locks and release locks, and it takes time to switch between the threads to save interrupts and restore interrupts.

Six, more data single-threaded and multi-threading comparison

We can deal with the text of correct, Errtime and Unkownsex.
Single Thread Code

# Coding=utf-8import os,sys,timereload (SYS) sys.setdefaultencoding (' utf-8 ') StartTime = Time.clock () Sumcount =   0boycount = 0girlcount = 0secretcount = 0unkowncount = 0for filename in Os.listdir (R ' E:\pythonProject\ruisi '): # Gender, Activity time If Filename.startswith (' correct '): NewName = ' e:\\pythonproject\\ruisi\\%s '% (filename) readFile = open (NewName, ' R ') for line in Readfile:sexinfo =line.split () [1] sumcount +=1 if sexinfo = = U ' \u7537 ': Boycou    NT + = 1 elif Sexinfo = = U ' \u5973 ': Girlcount +=1 elif Sexinfo = = U ' \u4fdd\u5bc6 ': Secretcount +=1 # print "Until%s, sum is%s boys; %s girls; %s secret; "% (filename, boycount,girlcount,secretcount) #没有活动时间, but with gender elif filename.startswith (" Errtime "): NewName = ' e:\\pythonproject\\ruisi\\%s '% (filename) readFile = open (NewName, ' R ') for line in Readfile:sexinfo =LINE.SP Lit () [1] sumcount +=1 if sexinfo = = U ' \u7537 ': Boycount + = 1 elif Sexinfo = = U ' \u5973 ':       Girlcount +=1 elif Sexinfo = = U ' \u4fdd\u5bc6 ': Secretcount +=1 # print "Until%s, sum is%s boys; %s girls; %s secret; "% (filename, boycount,girlcount,secretcount) #没有性别, no time, direct count of rows elif filename.startswith (" Unkownsex "): New Name = ' e:\\pythonproject\\ruisi\\%s '% (filename) # count = Len (open (NewName, ' RU '). ReadLines ()) #对于大文件用循环方法, Count beginning    The starting value of 1 is for the case of a blank line, and the last +1 gets 0 rows count = 1 for count, lines in enumerate (open (NewName, ' RU ')): Pass Count + = 1  Unkowncount + = Count Sumcount + = count # print "until%s, sum is%s unkownsex"% (filename, unkowncount) print "single Thread, Total is%s; %s Boys; %s girls; %s Secret; %s unkownsex; "% (Sumcount, boycount,girlcount,secretcount,unkowncount) EndTime = Time.clock () print" Cost time "+ STR ( Endtime-starttime) + "s"

Output to

Single Thread, Total is 61111; 13937 Boys; 4009 girls; 28942 secret; 14223 Unkownsex;
Cost Time 1.37444645628 S

Multithreaded code

__author__ = ' Admin ' # encoding:utf-8# multithreaded handler import threadingimport time,os,sys# global variable sum = 0BOY = 0GIRL = 0SECRET = 0UNKOWN = 0class Stafilelist (threading. Thread): #文本名称列表 fileList = [] def __init__ (self, fileList): Threading.       Thread.__init__ (self) self.filelist = FileList def run (self): Global SUM, Boy, GIRL, SECRET if Mutex.acquire (1): Self.stamanyfiles (self.filelist) mutex.release () #处理输入的files列表, statistics on the number of men and women #注意这儿数据同步问题 def stacorrectfiles (self,      files): Global SUM, Boy, GIRL, SECRET for name in Files:newname = ' e:\\pythonproject\\ruisi\\%s '% (name)  ReadFile = open (NewName, ' R ') for line in readfile:sexinfo = Line.split () [1] SUM +=1 if Sexinfo = = U ' \u7537 ': Boy + = 1 elif Sexinfo = = U ' \u5973 ': GIRL +=1 elif Sexinfo = = U ' \U4FDD\U5BC 6 ': SECRET +=1 # print "thread%s, until%s, total is%s; %s Boys; %s girls; "\ #"%s secret; "% (Self.name, name, SUM, Boy,girl,secret) def stamanyfiles (self, files): Global SUM, Boy, GIRL, Secret,unkown for name in Files:if Name.startswith (' Cor Rect '): NewName = ' e:\\pythonproject\\ruisi\\%s '% (name) ReadFile = open (NewName, ' R ') for line in R          Eadfile:sexinfo = Line.split () [1] SUM +=1 if sexinfo = = U ' \u7537 ': Boy + = 1  elif Sexinfo = = U ' \u5973 ': GIRL +=1 elif Sexinfo = = U ' \u4fdd\u5bc6 ': SECRET +=1 # Print "thread%s, until%s, total is%s; %s Boys; %s girls; "\ #"%s secret; "% (Self.name, name, SUM, Boy,girl,secret) #没有活动时间, but with gender elif NAME.STARTSWI Th ("Errtime"): NewName = ' e:\\pythonproject\\ruisi\\%s '% (name) ReadFile = open (NewName, ' R ') for Lin          E in readfile:sexinfo = Line.split () [1] SUM +=1 if sexinfo = = U ' \u7537 ': Boy + + 1 elif Sexinfo = = U ' \u5973 ': GIRL +=1 elif Sexinfo = = U ' \u4fdd\u5bc6': SECRET +=1 # print ' thread%s, until%s, total is%s; %s Boys; %s girls; "\ #"%s secret; "% (Self.name, name, SUM, Boy,girl,secret) #没有性别, no time, direct count of rows Elif Name.sta Rtswith ("Unkownsex"): NewName = ' e:\\pythonproject\\ruisi\\%s '% (name) # count = Len (open (NewName, ' RU '). Rea Dlines ()) #对于大文件用循环方法, the count initial value is 1 for the case of a blank row, and the last +1 gets 0 rows count = 1 for count, and line in enumerate (open ( NewName, ' RU '): Pass count + = 1 Unkown + = count SUM + = count # print "Thread%s, UNT Il%s, total is%s;  %s Unkownsex "% (self.name, name, SUM, unkown) def test (): Files = [] #用来保存所有的线程, convenient for the last main thread to wait so the child thread ends stathreads = [] i = 0 for filename in Os.listdir (R ' E:\pythonProject\ruisi '): #没获取10个文本, create a thread if Filename.startswith ("correct") or Filen        Ame.startswith ("Errtime") or Filename.startswith ("Unkownsex"): files.append (filename) i+=1 if i = = 20: Stathreads.append (stafilelist (Files))       Files = [] i = 0 #最后剩余的files, probably less than 10 if Files:staThreads.append (stafilelist (files)) for T in Stath Reads:t.start () # Waits for all child threads in the main thread to exit for T-in Stathreads:t.join () if __name__ = = ' __main__ ': Reload (SYS) SYS.SETDEFA Ultencoding (' utf-8 ') StartTime = Time.clock () Mutex = threading. Lock () test () print "Multi Thread, Total is%s; %s Boys; %s girls; %s Secret;   %s Unkownsex "% (SUM, boy,girl,secret,unkown) EndTime = Time.clock () print" Cost time "+ str (endtime-starttime) +" s " EndTime = Time.clock () print "Cost time" + str (endtime-starttime) + "s"

Output to

Multi Thread, Total is 61111; 13937 Boys; 4009 girls; 28942 secret;
Cost Time 1.23049112201 S
You can see that multithreading is better than single-threaded, due to the use of synchronization, data statistics are always.

Note that Python often needs to add self to the class, which differs greatly from Java.

def __init__ (self, fileList):    Threading. Thread.__init__ (self)    self.filelist = FileList  def run (self):    global SUM, Boy, GIRL, SECRET    if Mutex.acquire (1):      #调用类内部方法需要加self      self.stafiles (self.filelist)      mutex.release ()

Total is 61111; 13937 Boys; 4009 girls; 28942 secret; 14223 Unkownsex;
Cost Time 1.25413238673 S

The above is the whole content of this article, I hope that everyone's study has helped.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.