Github open-source project introduction-use pygrok to easily parse strings (log, event ..)

Last Update:2014-08-01 Source: Internet

Author: User

Tags pprint

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Github open-source project introduction-use pygrok to easily parse strings (log, event ..)

Pygrok is an open-source Python String Parsing Library. github address: https://github.com/garyelephant/pygrok. As described on the project homepage, it can be used to parse logs and events in the string form and extract useful information from strings. This string parsing library supports regular expression matching. It provides many predefined string matching modes, including strong matching capabilities of regular expressions and ease of use. The underlying layer of pygrok is also implemented using regular expressions.

To use pygrok, you only need to understand a simple interface grok_match (), a simple example:

Our task is to obtain "name", "gender", "Age", and "weight" information from strings such as 'gary is male, 25 years old and weighs 68.5kilograms.

>>> import pygrok>>> text = 'gary is male, 25 years old and weighs 68.5 kilograms'>>> pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'>>> print pygrok.grok_match(text, pattern){'gender': 'male', 'age': '25', 'name': 'gary', 'weight': '68.5'}

Pattern is a matching Pattern defined for the string to be parsed, so that grok_match () knows how to perform matching.

WORD, NUMBER, is the name of the mode. WORD indicates a WORD to be matched, which is equivalent to "\ B \ w + \ B" in the regular expression. NUMBER indicates a NUMBER (an integer or a decimal point) to be matched ), equivalent to the regular expression "(? <! [0-9. +-]) (?> [+-]? (? :(? : [0-9] + (? : \. [0-9] + )?) | (? : \. [0-9] + )))". It looks a bit obscure and complicated, but generally, you don't need to pay attention to these details. You only need to use "NUMBER" to match numbers, pygrok provides multiple modes for you to understand the content that can be matched by the mode name, such as "IP", "QUOTEDSTRING", and "DATE ". Look, it's easy !!

% {WORD: name} Means to match a WORD. After the WORD is extracted, it can be referenced by "name. Others are similar. So the final result is: {'gender': 'male', 'age': '25', 'name': 'gary ', 'weight': '68. 5 '}

Is this example too simple? To get domain, ip, timestamp, uripath, referrer, web browser from nginxlog:

>>> import pygrok>>> text = 'edge.v.iask.com.edge.sinastorage.com 14.18.243.65 6.032s - [21/Jul/2014:16:00:02 +0800]' \...     + ' "GET /edge.v.iask.com/125880034.hlv HTTP/1.0" 200 70528990 "-"' \...     + ' "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' \...     + ' Chrome/36.0.1985.125 Safari/537.36"'>>> pat = '%{HOST:host} %{IP:client_ip} %{NUMBER:delay}s - \[%{DATA:time_stamp}\]' \...     + ' "%{WORD:verb} %{URIPATHPARAM:uri_path} HTTP/%{NUMBER:http_ver}" %{INT:http_status} %{INT:bytes} %{QS}' \...     + ' %{QS:client}'>>> m = pygrok.grok_match(text, pat)>>> import pprint>>> pprint.pprint(m, indent = 4){   'bytes': '70528990',    'client': '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"',    'client_ip': '14.18.243.65',    'delay': '6.032',    'host': 'edge.v.iask.com.edge.sinastorage.com',    'http_status': '200',    'http_ver': '1.0',    'time_stamp': '21/Jul/2014:16:00:02 +0800',    'uri_path': '/edge.v.iask.com/125880034.hlv',    'verb': 'GET'}

There are several modes: INT, matching integer; IP, matching ip v4 or ipv6; HOST, matching domain name; URIPATHPARAM, matching url address and parameters following http. As mentioned at the beginning, there are both strong matching capabilities of regular expressions and simple ease of use.

Repost this article please indicate the author and the source [Gary's influence] http://garyelephant.me, do not for any commercial purposes!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More