python實現爬蟲統計學校BBS男女比例之資料處理(三)

來源:互聯網
上載者:User
本文主要介紹了資料處理方面的內容,希望大家仔細閱讀。

一、資料分析

得到了以下列字串開頭的文本資料,我們需要進行處理

二、復原

我們需要對httperror的資料進行再處理

因為代碼的原因,具體可見本系列文章(二),會導致文本裡面同一個id連續出現幾次httperror記錄:

//httperror265001_266001.txt265002 httperror265002 httperror265002 httperror265002 httperror265003 httperror265003 httperror265003 httperror265003 httperror

所以我們在代碼裡要考慮這種情形,不能每一行的id都進行處理,是判斷是否重複的id。

java裡面有緩衝方法可以避免頻繁讀取硬碟上的檔案,python其實也有,可以見這篇文章。

def main():  reload(sys)  sys.setdefaultencoding('utf-8')  global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5  sexRe = re.compile(u'em>\u6027\u522b(.*?)\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4(.*?))\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')  url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'  url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'  file1 = 'ruisi\\correct_re.txt'  file2 = 'ruisi\\errTime_re.txt'  file3 = 'ruisi\\notexist_re.txt'  file4 = 'ruisi\\unkownsex_re.txt'  file5 = 'ruisi\\httperror_re.txt'  #遍曆檔案夾裡面以httperror開頭的文本  for filename in os.listdir(r'E:\pythonProject\ruisi'):    if filename.startswith('httperror'):      count = 0      newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)      readFile = open(newName,'r')      oldLine = '0'      for line in readFile:        #newLine 用來比較是否是重複的id        newLine = line        if (newLine != oldLine):          nu = newLine.split()[0]          oldLine = newLine          count += 1          searchWeb((int(nu),))      print "%s deal %s lines" %(filename, count)

本代碼為了簡便,沒有再把httperror的那些id分類,直接儲存為下面這5個檔案裡

 file1 = 'ruisi\\correct_re.txt'  file2 = 'ruisi\\errTime_re.txt'  file3 = 'ruisi\\notexist_re.txt'  file4 = 'ruisi\\unkownsex_re.txt'  file5 = 'ruisi\\httperror_re.txt'

可以看下輸出Log記錄,總共處理了多少個httperror的資料。

"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/reload.pyhttperror132001-133001.txt deal 21 lineshttperror2001-3001.txt deal 4 lineshttperror251001-252001.txt deal 5 lineshttperror254001-255001.txt deal 1 lines

三、單線程統計unkownsex 資料

代碼簡單,我們利用單線程統計一下unkownsex(由於許可權原因無法擷取、或者該使用者沒有填寫)的使用者。另外,經過我們檢查,沒有性別的使用者也是沒有啟用時間的。

資料格式如下:

253042 unkownsex253087 unkownsex253102 unkownsex253118 unkownsex253125 unkownsex253136 unkownsex253161 unkownseximport os,timesumCount = 0startTime = time.clock()for filename in os.listdir(r'E:\pythonProject\ruisi'):  if filename.startswith('unkownsex'):    count = 0    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in open(newName):      count += 1      sumCount +=1    print "%s deal %s lines" %(filename, count)print '%s unkowns sex' %(sumCount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

處理速度很快,輸出如下:

unkownsex1-1001.txt deal 204 linesunkownsex100001-101001.txt deal 50 linesunkownsex10001-11001.txt deal 206 lines#...省略中間輸出資訊unkownsex99001-100001.txt deal 56 linesunkownsex_re.txt deal 1085 lines14223 unkowns sexcost time 0.0813142301261 s

四、單線程統計 correct 資料

資料格式如下:

31024 男 2014-11-11 13:2031283 男 2013-3-25 19:4131340 保密 2015-2-2 15:1731427 保密 2014-8-10 09:1731475 保密 2013-7-2 08:5931554 保密 2014-10-17 17:0231621 男 2015-5-16 19:2731872 保密 2015-1-11 16:4931915 保密 2014-5-4 11:0131997 保密 2015-5-16 20:14

代碼如下,實現思路就是一行一行讀取,利用line.split()擷取性別資訊。sumCount 是統計一個多少人,boycount 、girlcount 、secretcount 分別統計男、女、保密的人數。我們還是利用unicode進行正則匹配。

import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0for filename in os.listdir(r'E:\pythonProject\ruisi'):  if filename.startswith('correct'):    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo = line.split()[1]      sumCount +=1      if sexInfo == u'\u7537' :        boycount += 1      elif sexInfo == u'\u5973':        girlcount +=1      elif sexInfo == u'\u4fdd\u5bc6':        secretcount +=1    print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)print "total is %s; %s boys; %s girls; %s secret;" %(sumCount, boycount,girlcount,secretcount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

注意,我們輸出的是截止某個檔案的統計資訊,而不是單個檔案的統計情況。輸出結果如下:

until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret;until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret;#...省略until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret;total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 3.60047888495 s

五、多線程統計資料

為了更快統計,我們可以利用多線程。
作為對比,我們試下單線程需要的時間。

# encoding: UTF-8import threadingimport time,os,sys#全域變數SUM = 0BOY = 0GIRL = 0SECRET = 0NUM =0#本來繼承自threading.Thread,覆蓋run()方法,用start()啟動線程#這和java裡面很像class StaFileList(threading.Thread):  #文本名稱列表  fileList = []  def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    #可以加上個耗時時間,這樣多線程更加明顯,而不是順序的thread-1,2,3    #time.sleep(1)    #acquire擷取鎖    if mutex.acquire(1):      self.staFiles(self.fileList)      #release釋放鎖      mutex.release()  #處理輸入的files列表,統計男女人數  #注意這兒資料同步問題,global使用全域變數  def staFiles(self, files):    global SUM, BOY, GIRL, SECRET    for name in files:      newName = 'E:\\pythonProject\\ruisi\\%s' % (name)      readFile = open(newName,'r')      for line in readFile:        sexInfo = line.split()[1]        SUM +=1        if sexInfo == u'\u7537' :          BOY += 1        elif sexInfo == u'\u5973':          GIRL +=1        elif sexInfo == u'\u4fdd\u5bc6':          SECRET +=1      # print "thread %s, until %s, total is %s; %s boys; %s girls;" \      #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)def test():  #files儲存多個檔案,可以設定一個線程處理多少個檔案  files = []  #用來儲存所有的線程,方便最後主線程等待所以子線程結束  staThreads = []  i = 0  for filename in os.listdir(r'E:\pythonProject\ruisi'):    #沒擷取10個文本,就建立一個線程    if filename.startswith('correct'):      files.append(filename)      i+=1      #一個線程處理20個檔案      if i == 20 :        staThreads.append(StaFileList(files))        files = []        i = 0  #最後剩餘的files,很可能長度不足10個  if files:    staThreads.append(StaFileList(files))  for t in staThreads:    t.start()  # 主線程中等待所有子線程退出,如果不加這個,速度更快些?  for t in staThreads:    t.join()if __name__ == '__main__':  reload(sys)  sys.setdefaultencoding('utf-8')  startTime = time.clock()  mutex = threading.Lock()  test()  print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" %(SUM, BOY,GIRL,SECRET)  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"

輸出

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;cost time 0.132137192794 s

我們發現時間和單線程差不多。因為這兒涉及到線程同步問題,擷取鎖和釋放鎖都是需要時間開銷的,線程間切換儲存中斷和恢複中斷也都是需要時間開銷的。

六、較多資料的單線程和多線程對比

我們可以對correct、errTime 、unkownsex的文本都進行處理。
單線程代碼

# coding=utf-8import os,sys,timereload(sys)sys.setdefaultencoding('utf-8')startTime = time.clock()sumCount = 0boycount = 0girlcount = 0secretcount = 0unkowncount = 0for filename in os.listdir(r'E:\pythonProject\ruisi'):  # 有性別、啟用時間  if filename.startswith('correct') :    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo =line.split()[1]      sumCount +=1      if sexInfo == u'\u7537' :        boycount += 1      elif sexInfo == u'\u5973':        girlcount +=1      elif sexInfo == u'\u4fdd\u5bc6':        secretcount +=1    # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)  #沒有啟用時間,但是有性別  elif filename.startswith("errTime"):    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    readFile = open(newName,'r')    for line in readFile:      sexInfo =line.split()[1]      sumCount +=1      if sexInfo == u'\u7537' :        boycount += 1      elif sexInfo == u'\u5973':        girlcount +=1      elif sexInfo == u'\u4fdd\u5bc6':        secretcount +=1    # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount)  #沒有性別,也沒有時間,直接統計行數  elif filename.startswith("unkownsex"):    newName = 'E:\\pythonProject\\ruisi\\%s' % (filename)    # count = len(open(newName,'rU').readlines())    #對於大檔案用迴圈方法,count 初始值為 -1 是為了應對空行的情況,最後+1得到0行    count = -1    for count, line in enumerate(open(newName, 'rU')):      pass    count += 1    unkowncount += count    sumCount += count    # print "until %s, sum is %s unkownsex" %(filename, unkowncount)print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" %(sumCount, boycount,girlcount,secretcount,unkowncount)endTime = time.clock()print "cost time " + str(endTime - startTime) + " s"

輸出為

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s

多線程代碼

__author__ = 'admin'# encoding: UTF-8#多執行緒程式import threadingimport time,os,sys#全域變數SUM = 0BOY = 0GIRL = 0SECRET = 0UNKOWN = 0class StaFileList(threading.Thread):  #文本名稱列表  fileList = []  def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    if mutex.acquire(1):      self.staManyFiles(self.fileList)      mutex.release()  #處理輸入的files列表,統計男女人數  #注意這兒資料同步問題  def staCorrectFiles(self, files):    global SUM, BOY, GIRL, SECRET    for name in files:      newName = 'E:\\pythonProject\\ruisi\\%s' % (name)      readFile = open(newName,'r')      for line in readFile:        sexInfo = line.split()[1]        SUM +=1        if sexInfo == u'\u7537' :          BOY += 1        elif sexInfo == u'\u5973':          GIRL +=1        elif sexInfo == u'\u4fdd\u5bc6':          SECRET +=1      # print "thread %s, until %s, total is %s; %s boys; %s girls;" \      #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)  def staManyFiles(self, files):    global SUM, BOY, GIRL, SECRET,UNKOWN    for name in files:      if name.startswith('correct') :        newName = 'E:\\pythonProject\\ruisi\\%s' % (name)        readFile = open(newName,'r')        for line in readFile:          sexInfo = line.split()[1]          SUM +=1          if sexInfo == u'\u7537' :            BOY += 1          elif sexInfo == u'\u5973':            GIRL +=1          elif sexInfo == u'\u4fdd\u5bc6':            SECRET +=1        # print "thread %s, until %s, total is %s; %s boys; %s girls;" \        #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)      #沒有啟用時間,但是有性別      elif name.startswith("errTime"):        newName = 'E:\\pythonProject\\ruisi\\%s' % (name)        readFile = open(newName,'r')        for line in readFile:          sexInfo = line.split()[1]          SUM +=1          if sexInfo == u'\u7537' :            BOY += 1          elif sexInfo == u'\u5973':            GIRL +=1          elif sexInfo == u'\u4fdd\u5bc6':            SECRET +=1        # print "thread %s, until %s, total is %s; %s boys; %s girls;" \        #    " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET)      #沒有性別,也沒有時間,直接統計行數      elif name.startswith("unkownsex"):        newName = 'E:\\pythonProject\\ruisi\\%s' % (name)        # count = len(open(newName,'rU').readlines())        #對於大檔案用迴圈方法,count 初始值為 -1 是為了應對空行的情況,最後+1得到0行        count = -1        for count, line in enumerate(open(newName, 'rU')):          pass        count += 1        UNKOWN += count        SUM += count        # print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN)def test():  files = []  #用來儲存所有的線程,方便最後主線程等待所以子線程結束  staThreads = []  i = 0  for filename in os.listdir(r'E:\pythonProject\ruisi'):    #沒擷取10個文本,就建立一個線程    if filename.startswith("correct") or filename.startswith("errTime") or filename.startswith("unkownsex"):      files.append(filename)      i+=1      if i == 20 :        staThreads.append(StaFileList(files))        files = []        i = 0  #最後剩餘的files,很可能長度不足10個  if files:    staThreads.append(StaFileList(files))  for t in staThreads:    t.start()  # 主線程中等待所有子線程退出  for t in staThreads:    t.join()if __name__ == '__main__':  reload(sys)  sys.setdefaultencoding('utf-8')  startTime = time.clock()  mutex = threading.Lock()  test()  print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" %(SUM, BOY,GIRL,SECRET,UNKOWN)  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"  endTime = time.clock()  print "cost time " + str(endTime - startTime) + " s"

輸出為

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多線程還是優於單線程的,由於使用的同步,資料統計是一直的。

注意python在類內部經常需要加上self,這點和java區別很大。

 def __init__(self, fileList):    threading.Thread.__init__(self)    self.fileList = fileList  def run(self):    global SUM, BOY, GIRL, SECRET    if mutex.acquire(1):      #調用類內部方法需要加self      self.staFiles(self.fileList)      mutex.release()

total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s

以上就是本文的全部內容,希望對大家的學習有所協助。

  • 聯繫我們

    該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

    如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.