我在江北用Python 多線程收集掃描器字典

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

胡哥給的任務是精簡他給的掃描器字典。

我的思路是這樣子的：

1.從一大堆檔案中篩選出是掃描器構造的URL

2.對這些篩選出來的URL進行統計排序，和胡哥給的字典對比，留下吻合度高的字典。

3.從第三方web應用中擷取URL作為字典的一部分，畢竟現在很多使用者都在用第三方的web應用如織夢CMS，wordpress，一旦掃描起來，準確度特別高。

問題1：.如果胡哥給的檔案的資訊過少，導致篩選出來的字典吻合度都很低，那就坑爹了。（已解決）

這個問題問題好，不過肉眼目測有6w個資料，應該不會出現字典資訊過少的情況。

問題2：要怎麼確定這個URL是掃描器而不是使用者正常訪問呢？（已解決）

這個思路有如下幾點：

1)關鍵字如fuck，sql，webshell（當然還要很多）通通視為掃描器在掃描，因為正常使用者都不會訪問這些連結，記錄這些IP，然後擷取IP所訪問的所有URL。

2)統計檔案中請求IP的TOP10（根據需要可以設定這個TOP n），如果是則把這些IP所掃過的URL加入到字典中，畢竟正常使用者的訪問不可能特別頻繁。

3)把掃描後台等敏感目錄的IP視為惡意IP，並且把這個IP所掃過的URL記錄為字典，如果這個IP是正常使用者，那麼他的訪問必定數量很少，字典這點冗餘可以接受，

如果這個IP是掃描器，那麼我們就收集它的字典並加到我們的字典中。

問題3：如果掃描器也是個冗餘字典，那怎麼辦？達不到我要精簡字典的目的啊！（這個問題無法解決）

胡哥說，這個問題姑且留著。我個人認為這已經晁超出了自動化的範圍了。

=======================================================================

第一步系統架構，好吧，這不叫架構我的工作：

讀取keyword.txt，放到keywords[]中

線程1

判斷檔案Top n檔案是否存在，若存在則停止線程

跑出Top 10 IP

線程 2

判斷檔案Top n檔案是否存在，若不存在則等線程1完成wait（）

讀取in.txt 放到 TopN[]中

while Not EOF

如果匹配到TopN中的資料，則命中，加入到out.txt中

線程 3

讀取in.txt，分析每一條資料。

while Not EOF

如果匹配到keywords中的資料，則命中

此record的IP是否在DirtyIP[]中，若是

pass

否則

DirtyIP.append(IP)

record的IP在DirtyIP中，則

寫入到out.txt 中

否則

pass

如果所有線程都完成了

對out.txt去重（這個必須在最後才能的操作，沒辦法在讀入的時候處理）

=======================================================================

我的工作（細化）：

flag = Top N檔案是否存在的標誌

線程2,3可以用同一個輔助函數：

bool isHitTarget(array[] , string record )

hit =fasle

for (element in array)

if element is the substring of record

hit = true

if hit == true

加入到 out.txt中

怎麼對大資料去重呢？，我自己寫的這個，不知道能不能承受大資料的衝擊呢？效率高不高呢？誰用誰知道。。。

void getSingleRecord(filename)

while Not EOF

record = read from out.txt

if record in new_records

pass

else

new_records.append(record)

因為要寫入到同一個out.txt檔案中，所有要用互斥量，怎麼寫呢？

建立鎖： g_mutex = threading.lock()

使用鎖： g_mutex.acquire() ...

釋放鎖： g_mutex.release()

這三個線程分別用三個函數來解決：

def getTopN_IP(int n,string filename); 對應 th1 = threading.Thread(target = getTopN_IP, args =(n,filename) )

def getURLFromTopN_IP(topN[],filename) 對應 th2 = threading.Thread(target = getURLFromTopN_IP , args = (topN, filename))

def getURLFromDirty_IP(filename) 對應 th3 = threading.Thread(target = getURLFromDirty_IP,args = (filename))

等待線程完成：th1.join() th2.join() th3.join()

怎麼從一條record中擷取IP，擷取關鍵的URL呢？

我首先就想到了Regex，以下是IP地址的Regex（我姑且相信它是對的）：

((?:(?:25[0-5]|2[0-4]\d|((1\d{2})|([1-9]?\d)))\.){3}(?:25[0-5]|2[0-4]\d|((1\d{2})|([1-9]?\d))))

但是我有更好的方法哦~~

因為胡哥給的資料都是IIS伺服器的log，所以都是有特定格式的說。

我們可以根據這些特定格式來做文章，用split，然後數組的第i個和第j個就是我們要的IP地址和URL關鍵詞了，這個方法不錯吧，是吧？

==========================================================================================

第二步測試各個功能1.先測試互斥量等子功能吧這個python的threadng還真的是有意思。在指定args=(a,b)的時候，如果arg是單數的話，還要加個','，不然就一直報錯：TypeError: getURLFromDirty_IP() takes exactly 1 argument (6 given)這個Python開發人員語文學得就是好，args就必須不是是單數啊。。尼瑪這個錯誤我檢查了老半天，對照別人代碼看，很難發現這個問題啊。。

import threading def getURLFromDirty_IP(filename):    print "3" if __name__ == "__main__":    infile  = "in.txt"    th3 = threading.Thread(target = getURLFromDirty_IP,args = (infile,));  //這裡args中不加','會報錯    th3.start()    th3.join()        print "Hello World";

期間遇到問題：IIS伺服器的日誌格式是可以修改的，所以會導致我的程式局限性太大了，所以要做一定的修改，而怎麼修改呢，就是要用正則去匹配！！

天啊，過了一天之後還是要用正則！不過幸好Python對正則還是很支援的！！2.Python的線程鎖：

mutex = threading.Lock() #建立線程鎖，畢竟讀檔案存在競爭 mutex.acquire(100)#加個互斥鎖 out.write(Path+"\r\n") mutex.release() #釋放鎖

3.Python 判斷檔案和檔案夾是否存在：

import osos.path.isfile(infile) #返回False就不是檔案，返回True就是了os.path.exists(directory) #如果目錄不存在就返回False

第三步終於完成v1.0版本了（還差Regex！）和更多的測試：注釋還算可以~~

import os.path# To change this template, choose Tools | Templates# and open the template in the editor.__author__="Administrator"__date__ ="$2012-10-30 17:13:46$"import threadingimport ostopN_IP = []; n = 10 #n是TOP N的n啊~~預設是10threads = []keywords = []infile  = "../infile/"outfile = "../outfile/"topNFile = "../topNFile/"dirtyFile = "../dirtyFile/dirtywords.txt"mutex = threading.Lock() #建立線程鎖，畢竟讀檔案存在競爭def getTopN_IP(n,infile,outfile):    #IPs =  "haha aa".split(" ")    IPs = []    isRegetIP = False    #如果檔案已經存在，則預設我們曾經跑過了這個TopN_IP，pass    if True == os.path.isfile(topNFile)  :        print topNFile +"已經存在，太好了~"        f = file(topNFile,"r")        while True:            tmpLine = f.readline()            if tmpLine == "":                break            topN_IP.append(tmpLine)        f.close()        if 0 == len(topN_IP):            print "檔案雖然存在，但是為空白，請重新加入TOP_N_IP"            isRegetIP = True    if False == isRegetIP:        f = file(infile,"r")        while  True:            tmpLine = f.readline()            if tmpLine == "":                break            tmpList = tmpLine.split(' ')            #我們要解析的檔案是IIS的日誌如：            #2012-03-17 07:21:50 192.168.100.20 GET / - 80 - 49.94.46.156 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_7_3)+AppleWebKit/534.53.11+(KHTML,+like+Gecko)+Version/5.1.3+Safari/534.53.10 200 0 0 0            #很明顯，這個結構很清晰，而且是通用的，不需要用正則去搞            #第九個是目標IP！print tmpList[8]            IPs.append(tmpList[8])        f.close()    #去重這句話好簡單時尚啊~    singleIP  = {}.fromkeys(IPs).keys()    IPDict = {}    for tmp in singleIP:        IPDict[tmp] = 0;    for tmp in IPs:        IPDict[tmp] += 1    #對字典進行排序key=lambda e:e[1]表示對value排序。key=lambda e:e[0]對key排序    #IPDict.items()把字典搞成元祖集合的形式    #lambda就是匿名函數中，語句中冒號前是參數，可以有多個，用逗號隔開，冒號右邊的傳回值。    sortIP=sorted(IPDict.items(),key=lambda e:e[1],reverse=True)    index = 0    for tmp in sortIP:        index += 1        #因為元組(IP,個數),所以就是這麼擷取ip        topN_IP.append(tmp[0])        #print tmp    if index < 10:        n = indexdef getURLFromTopN_IP(topN_IP,infile,outfile):    #print topN_IP    #print len(topN_IP)    if 0 == len(topN_IP):        print "top 名單為空白"        pass    else:        f = file(infile,"r")        out = file(outfile,"w+")        #還是根據IIS日誌結構的來擷取這個路徑吧正則太困難了        #也就是tmpList[4]        while  True:            tmpLine = f.readline()            if tmpLine == "":                break            tmpList = tmpLine.split(' ')            IP = tmpList[8]            Path = tmpList[4]            if IP in topN_IP:                #加個互斥鎖                mutex.acquire(100)                out.write(Path+"\r\n")                mutex.release()        out.close()        f.close()def getURLFromDirty_IP(infile,outfile,dirtyFile):    f = file(infile,"r")    out = file(outfile,"w+")    dfile = file(dirtyFile,"r")        #匯入髒keywords    dirtywords = []    while True:        tmpLine = dfile.readline()        if tmpLine == "" :            break        dirtywords.append(tmpLine)    dfile.close()        #字串匹配        while True:        flag = False        tmpLine = f.readline()        if tmpLine == "":            break        for word in dirtywords:            if True == tmpLine.find(word):                 flag = True                break        if flag:            tmpList = tmpLine.split(' ')             #加個互斥鎖            mutex.acquire(100)            out.write(tmpList[4]+"\r\n")            mutex.release()                f.close()    out.close()def getSingleRecord(outfile):    f = file(outfile,"r")    allList = []    while True:        tmpLine = f.readline()        if tmpLine == "" :            break        allList.append(tmpLine)    singleList  = {}.fromkeys(allList).keys()    f.close()    f2 = file(outfile,"w")    for word in singleList:        f2.write(word+"/r/n")    f2.close()if __name__ == "__main__":    while True:        infile  = "../infile/"        outfile = "../outfile/"        tmpfile = raw_input("請輸入檔案名稱(退出請輸入：呵呵):")        infile += tmpfile        if infile == "呵呵":            print "歡迎下次使用哦~~Ps：呵呵你妹！"            break        if False == os.path.isfile(infile):            print "您輸入的檔案不存在哦~~"            continue        outfile += tmpfile[0:len(tmpfile)-4]        outfile += "_out.txt"        print "檔案輸出名為："+outfile        th1 = threading.Thread(target = getTopN_IP,args = (10,infile,outfile));        th2 = threading.Thread(target = getURLFromTopN_IP,args = (topN_IP,infile,outfile));        th3 = threading.Thread(target = getURLFromDirty_IP,args = (infile,outfile,dirtyFile));        threads.append(th1);threads.append(th2);threads.append(th3);        th1.start()        th1.join()        th2.start()        th3.start()        th2.join()        th3.join()        #最終結果去重        getSingleRecord(outfile)        print "處理完畢，請去檔案夾目錄查看處理結果："+outfile;

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More