標籤:學習筆記
日誌分析
分析的前提半結構化資料
- 日誌是半結構化資料,是有組織的,有格式的資料。可以分割成行和列,就可以當作表理解和處理,分析裡面的資料
文本分析
提取資料一、分割
import datetimeline = ‘‘‘123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] "GET / HTTP/1.1" 200 8642 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"‘‘‘CHARS = set(" \t")def makekey(line: str): start = 0 skip = False for i, c in enumerate(line): if not skip and c in ‘"[‘: start = i + 1 skip = True elif skip and c in ‘"]‘: skip = False yield line[start:i] start = i + 1 continue if skip: continue if c in CHARS: if start == i: start = i + 1 continue yield line[start:i] start = i + 1 else: if start < len(line): yield line[start:]names = (‘remote‘, ‘‘, ‘‘, ‘datetime‘, ‘request‘, ‘status‘, ‘length‘, ‘‘, ‘useragent‘)ops = (None, None, None, lambda timestr: datetime.datetime.strptime(timestr, ‘%d/%b/%Y:%H:%M:%S %z‘), lambda request: dict(zip([‘method‘, ‘url‘, ‘protocol‘], request.split())), int, int, None, None)def extract(line: str): return dict(map(lambda item: (item[0], item[2](item[1]) if item[2] is not None else item[1]), zip(names, makekey(line), ops)))print(extract(line))
二、Regex分割
PATTERN = r‘‘‘(?P<ip>[\d.]{7,})\s-\s-\s\[(?P<datetime>[^\[\]]+)\]\s"(?P<method>[^"\s]+)\s(?P<url>[^"\s]+)\s(?P<protocol>[^"\s]+)"\s(?P<status>\d{3})\s(?P<size>\d+)\s"(?:.+)"\s"(?P<useragent>[^"]+)"‘‘‘pattern = re.compile(PATTERN)ops = {‘datetime‘: (lambda x: datetime.datetime.strptime(x, ‘%d/%b/%Y:%H:%M:%S %z‘)), ‘status‘: int, ‘size‘: int}def extract(text): mat = pattern.match(text) return {k: ops.get(k, lambda x: x)(v) for k, v in mat.groupdict().items()}
異常處理
- 日誌中不免會出現一些不匹配的行,需要處理
- 這裡使用re.match方法,有可能匹配不上。所以要增加一個判斷
- 採用拋出異常的方式,讓調用者獲得異常並自行處理
PATTERN = r‘‘‘(?P<ip>[\d.]{7,})\s-\s-\s\[(?P<datetime>[^\[\]]+)\]\s"(?P<method>[^"\s]+)\s(?P<url>[^"\s]+)\s(?P<protocol>[^"\s]+)"\s(?P<status>\d{3})\s(?P<size>\d+)\s"(?:.+)"\s"(?P<useragent>[^"]+)"‘‘‘pattern = re.compile(PATTERN)ops = {‘datetime‘: (lambda x: datetime.datetime.strptime(x, ‘%d/%b/%Y:%H:%M:%S %z‘)), ‘status‘: int, ‘size‘: int}def extract(text) -> dict: mat = pattern.match(text) if mat: return {k: ops.get(k, lambda x: x)(v) for k, v in mat.groupdict().items()} else: raise Exception(‘No match‘)
def extract(text) -> dict: mat = pattern.match(text) if mat: return {k: ops.get(k, lambda x: x)(v) for k, v in mat.groupdict().items()} else: return None
滑動視窗資料載入
def load(path): with open(path) as f: for line in f: fields = extract(line) if fields: yield fields else: continue
時間視窗分析概念
- 很多資料,例如日誌,都是和時間相關的,都是按照時間順序產生的
產生的資料分析的時候,要按照時間求值
- interval表示每一次求值的時間間隔
- width時間窗介面寬度,指一次求值的時間視窗寬度
當width > interval
當width = interval
時序資料
- 營運環境中,日誌、監控等產生的資料都是與時間相關的資料,按照時間先後產生並記錄下來的資料,所以一般按照時間對資料進行分析
資料分析基本程式結構
- 無限的產生隨機數函數,產生時間相關的資料,返回時間和隨機數字典
import randomimport datetimeimport timedef source(): while True: yield {‘value‘: random.randint(1, 100), ‘datetime‘: datetime.datetime.now()} time.sleep(1)s = source()items = [next(s) for _ in range(3)]def handler(iterable): return sum(map(lambda item: item[‘value‘], iterable)) / len(iterable)print(items)print("{:.2f}".format(handler(items)))
視窗函數實現
import randomimport datetimeimport timedef source(second=1): while True: yield {‘value‘: random.randint(1, 100), ‘datetime‘: datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=8)))} time.sleep(second)def window(iterator, handler, width: int, interval: int): start = datetime.datetime.strptime(‘20170101 000000 +0800‘, ‘%Y%m%d %H%M%S %z‘) current = datetime.datetime.strptime(‘20170101 010000 +0800‘, ‘%Y%m%d %H%M%S %z‘) buffer = [] delta = datetime.timedelta(seconds=width - interval) while True: data = next(iterator) if data: buffer.append(data) current = data[‘datetime‘] if (current - start).total_seconds() >= interval: ret = handler(buffer) print(‘{:.2f}‘.format(ret)) start = current buffer = [x for x in buffer if x[‘datetime‘] > current - delta]def handler(iterable): return sum(map(lambda item: item[‘value‘], iterable)) / len(iterable)window(source(), handler, 10, 5)
分發生產者消費者模型
queue模組——隊列
queue.Queue(maxsize=0)
- 建立FIFO隊列,返回Queue對象
- maxsize小於等於0,隊列長度沒有限制
Queue.get(block=True,timeout=None)
- 從隊列中移除元素並返回這個元素
- block 阻塞,timeout 逾時
- 如果block為True,是阻塞,timeout為None就是一直阻塞
- 如果block為True但是timeout有值,就阻塞到一定秒數拋出異常
- block為False,是非阻塞,timeout將被忽略,要麼成功返回一個元素,要麼拋出empty異常
Queue.get_nowait()
Queue.put(item,block=True,timeout=None)
- 把一個元素加入到隊列中去
- block=True,timeout=None,一直阻塞直至有空位放元素
- block=True,timeout=5,阻塞5秒就拋出Full異常
- block=True,timeout失效,立刻返回,,一直阻塞直至有空位放元素
- Queue.put_nowait(item)
分發器實現
生產者(資料來源)生產資料,緩衝到訊息佇列中
資料處理流程:
資料載入 -> 提取 -> 分析(滑動視窗函數)
- 處理大量資料的時候,可能需要多個消費者處理
- 需要一個分發器(調度器),把資料分發給不同的消費者處理
- 每一個消費者拿到資料後,有自己的處理函數。所以要有一種註冊機制
資料載入 -> 提取 -> 分發 -> 分析函數1&分析函數2
分發器代碼實現
def dispatcher(src): reg_handler = [] queues = [] def reg(handler, width, interval): q = Queue() queues.append(q) thrd = threading.Thread(target=window, args=(q, handler, width, interval)) reg_handler.append(thrd) def run(): for i in reg_handler: i.start() for item in src: for q in queues: q.put(item) return reg, runreg, run = dispatcher(load(‘test.log‘))reg(handler, 10, 5)run()
整合代碼
- load函數就是從日誌中提取合格的資料產生函數
- 它可以作為dispatcher函數的資料來源
import refrom pathlib import Pathimport datetimeimport timeimport threadingfrom queue import Queuefrom user_agents import parsePATTERN = r‘‘‘(?P<ip>[\d.]{7,})\s-\s-\s\[(?P<datetime>[^\[\]]+)\]\s"(?P<method>[^"\s]+)\s(?P<url>[^"\s]+)\s(?P<protocol>[^"\s]+)"\s(?P<status>\d{3})\s(?P<size>\d+)\s"(?:.+)"\s"(?P<useragent>[^"]+)"‘‘‘pattern = re.compile(PATTERN)def extract(text): ops = {‘datetime‘: (lambda x: datetime.datetime.strptime(x, ‘%d/%b/%Y:%H:%M:%S %z‘)), ‘status‘: int, ‘size‘: int, ‘useragent‘: lambda x: parse(x)} mat = pattern.match(text) return {k: ops.get(k, lambda x: x)(v) for k, v in mat.groupdict().items()}def openfile(filename): with open(filename) as f: for text in f: fields = extract(text) time.sleep(2) if fields: yield fields else: continue# producerdef load(*pathnames): for path in pathnames: pathname = Path(path) if not pathname.exists(): continue if pathname.is_file(): yield from openfile(pathname) elif pathname.is_dir(): for filename in pathname.iterdir(): if filename.is_file(): yield from openfile(filename)def sum_size_handler(iterable): return sum(map(lambda x: x[‘size‘], iterable))def status_handler(iterable): status = {} for dic in iterable: key = dic[‘status‘] status[key] = status.get(key, 0) + 1 return {k: v / len(iterable) for k, v in status.items()}d = {}def ua_handler(iterable): ua_family = {} for item in iterable: val = item[‘useragent‘] key = (val.browser.family, val.browser.version_string) ua_family[key] = ua_family.get(key, 0) + 1 d[key] = d.get(key, 0) + 1 return ua_family, d# consumerdef window(q: Queue, handler, width, interval): st_time = datetime.datetime.strptime(‘19700101 000000 +0800‘, ‘%Y%m%d %H%M%S %z‘) cur_time = datetime.datetime.strptime(‘19700101 010000 +0800‘, ‘%Y%m%d %H%M%S %z‘) buffer = [] while True: # src = next(iterable) src = q.get() print(src) buffer.append(src) cur_time = src[‘datetime‘] if (cur_time - st_time).total_seconds() > interval: val = handler(buffer) st_time = cur_time b, d = val d = sorted(d.items(), key=lambda x: x[1], reverse=True) print(val) print(d) buffer = [x for x in buffer if x[‘datetime‘] > (cur_time - datetime.timedelta(seconds=width - interval))]def dispatcher(src): reg_handler = [] queues = [] def reg(handler, width, interval): q = Queue() queues.append(q) thrd = threading.Thread(target=window, args=(q, handler, width, interval)) reg_handler.append(thrd) def run(): for i in reg_handler: i.start() for item in src: for q in queues: q.put(item) return reg, runif __name__ == ‘__main__‘: import sys # path=sys.argv[1] path = ‘test.log‘reg, run = dispatcher(load(‘test.log‘))# reg(sum_size_handler, 20, 5)# reg(status_handler, 20, 5)reg(ua_handler, 20, 5)run()
完成分析功能
- 分析日誌很重要,通過海量資料分析就能夠知道是否遭受了攻擊,是否被爬取及爬取高峰期,是否有盜鏈等
狀態代碼分析
- 狀態代碼中包含了很多資訊。例如
- 304,伺服器收到用戶端提交的請求參數,發現資源未變化,要求瀏覽器使用靜態資源的緩衝
- 404,伺服器找不到大請求的資源
- 304佔比大,說明靜態緩衝效果明顯。404佔比大,說明網站出現了錯誤連結,或者嘗試嗅探網站資源
- 如果400、500佔比突然增大,網站一定出了問題。
def status_handler(iterable): status = {} for dic in iterable: key = dic[‘status‘] status[key] = status.get(key, 0) + 1 return {k: v / len(iterable) for k, v in status.items()}
瀏覽器分析useragent
- 這裡指的是,軟體按照一定的格式向遠端的伺服器提供一個表示自己的字串
- 在HTTP協議中,使用useragent欄位傳送這個字串
瀏覽器選項中可以修改此設定
資訊提取安裝
pip install pyyaml ua-parser user-agents
資料分析
d = {}def ua_handler(iterable): ua_family = {} for item in iterable: val = item[‘useragent‘] key = (val.browser.family, val.browser.version_string) ua_family[key] = ua_family.get(key, 0) + 1 d[key] = d.get(key, 0) + 1 return ua_family, d
Python第七周 學習筆記(1)