Github open-source project introduction-use pygrok to easily parse strings (log, event ..)
Pygrok is an open-source Python String Parsing Library. github address: https://github.com/garyelephant/pygrok. As described on the project homepage, it can be used to parse logs and events in the string form and extract useful information from strings. This string parsing library supports regular expression matching. It provides many predefined string matching modes, including strong matching capabilities of regular expressions and ease of use. The underlying layer of pygrok is also implemented using regular expressions.
To use pygrok, you only need to understand a simple interface grok_match (), a simple example:
Our task is to obtain "name", "gender", "Age", and "weight" information from strings such as 'gary is male, 25 years old and weighs 68.5kilograms.
>>> import pygrok>>> text = 'gary is male, 25 years old and weighs 68.5 kilograms'>>> pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'>>> print pygrok.grok_match(text, pattern){'gender': 'male', 'age': '25', 'name': 'gary', 'weight': '68.5'}
Pattern is a matching Pattern defined for the string to be parsed, so that grok_match () knows how to perform matching.
WORD, NUMBER, is the name of the mode. WORD indicates a WORD to be matched, which is equivalent to "\ B \ w + \ B" in the regular expression. NUMBER indicates a NUMBER (an integer or a decimal point) to be matched ), equivalent to the regular expression "(? <! [0-9. +-]) (?> [+-]? (? :(? : [0-9] + (? : \. [0-9] + )?) | (? : \. [0-9] + )))". It looks a bit obscure and complicated, but generally, you don't need to pay attention to these details. You only need to use "NUMBER" to match numbers, pygrok provides multiple modes for you to understand the content that can be matched by the mode name, such as "IP", "QUOTEDSTRING", and "DATE ". Look, it's easy !!
% {WORD: name} Means to match a WORD. After the WORD is extracted, it can be referenced by "name. Others are similar. So the final result is: {'gender': 'male', 'age': '25', 'name': 'gary ', 'weight': '68. 5 '}
Is this example too simple? To get domain, ip, timestamp, uripath, referrer, web browser from nginxlog:
>>> import pygrok>>> text = 'edge.v.iask.com.edge.sinastorage.com 14.18.243.65 6.032s - [21/Jul/2014:16:00:02 +0800]' \... + ' "GET /edge.v.iask.com/125880034.hlv HTTP/1.0" 200 70528990 "-"' \... + ' "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' \... + ' Chrome/36.0.1985.125 Safari/537.36"'>>> pat = '%{HOST:host} %{IP:client_ip} %{NUMBER:delay}s - \[%{DATA:time_stamp}\]' \... + ' "%{WORD:verb} %{URIPATHPARAM:uri_path} HTTP/%{NUMBER:http_ver}" %{INT:http_status} %{INT:bytes} %{QS}' \... + ' %{QS:client}'>>> m = pygrok.grok_match(text, pat)>>> import pprint>>> pprint.pprint(m, indent = 4){ 'bytes': '70528990', 'client': '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', 'client_ip': '14.18.243.65', 'delay': '6.032', 'host': 'edge.v.iask.com.edge.sinastorage.com', 'http_status': '200', 'http_ver': '1.0', 'time_stamp': '21/Jul/2014:16:00:02 +0800', 'uri_path': '/edge.v.iask.com/125880034.hlv', 'verb': 'GET'}
There are several modes: INT, matching integer; IP, matching ip v4 or ipv6; HOST, matching domain name; URIPATHPARAM, matching url address and parameters following http. As mentioned at the beginning, there are both strong matching capabilities of regular expressions and simple ease of use.
Repost this article please indicate the author and the source [Gary's influence] http://garyelephant.me, do not for any commercial purposes!