Github open-source project introduction-use pygrok to easily parse strings (log, event ..)

Source: Internet
Author: User
Tags pprint

Github open-source project introduction-use pygrok to easily parse strings (log, event ..)

Pygrok is an open-source Python String Parsing Library. github address: https://github.com/garyelephant/pygrok. As described on the project homepage, it can be used to parse logs and events in the string form and extract useful information from strings. This string parsing library supports regular expression matching. It provides many predefined string matching modes, including strong matching capabilities of regular expressions and ease of use. The underlying layer of pygrok is also implemented using regular expressions.

To use pygrok, you only need to understand a simple interface grok_match (), a simple example:

Our task is to obtain "name", "gender", "Age", and "weight" information from strings such as 'gary is male, 25 years old and weighs 68.5kilograms.
>>> import pygrok>>> text = 'gary is male, 25 years old and weighs 68.5 kilograms'>>> pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'>>> print pygrok.grok_match(text, pattern){'gender': 'male', 'age': '25', 'name': 'gary', 'weight': '68.5'}
Pattern is a matching Pattern defined for the string to be parsed, so that grok_match () knows how to perform matching.

WORD, NUMBER, is the name of the mode. WORD indicates a WORD to be matched, which is equivalent to "\ B \ w + \ B" in the regular expression. NUMBER indicates a NUMBER (an integer or a decimal point) to be matched ), equivalent to the regular expression "(? <! [0-9. +-]) (?> [+-]? (? :(? : [0-9] + (? : \. [0-9] + )?) | (? : \. [0-9] + )))". It looks a bit obscure and complicated, but generally, you don't need to pay attention to these details. You only need to use "NUMBER" to match numbers, pygrok provides multiple modes for you to understand the content that can be matched by the mode name, such as "IP", "QUOTEDSTRING", and "DATE ". Look, it's easy !!

% {WORD: name} Means to match a WORD. After the WORD is extracted, it can be referenced by "name. Others are similar. So the final result is: {'gender': 'male', 'age': '25', 'name': 'gary ', 'weight': '68. 5 '}

Is this example too simple? To get domain, ip, timestamp, uripath, referrer, web browser from nginxlog:

>>> import pygrok>>> text = 'edge.v.iask.com.edge.sinastorage.com 14.18.243.65 6.032s - [21/Jul/2014:16:00:02 +0800]' \...     + ' "GET /edge.v.iask.com/125880034.hlv HTTP/1.0" 200 70528990 "-"' \...     + ' "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' \...     + ' Chrome/36.0.1985.125 Safari/537.36"'>>> pat = '%{HOST:host} %{IP:client_ip} %{NUMBER:delay}s - \[%{DATA:time_stamp}\]' \...     + ' "%{WORD:verb} %{URIPATHPARAM:uri_path} HTTP/%{NUMBER:http_ver}" %{INT:http_status} %{INT:bytes} %{QS}' \...     + ' %{QS:client}'>>> m = pygrok.grok_match(text, pat)>>> import pprint>>> pprint.pprint(m, indent = 4){   'bytes': '70528990',    'client': '"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"',    'client_ip': '14.18.243.65',    'delay': '6.032',    'host': 'edge.v.iask.com.edge.sinastorage.com',    'http_status': '200',    'http_ver': '1.0',    'time_stamp': '21/Jul/2014:16:00:02 +0800',    'uri_path': '/edge.v.iask.com/125880034.hlv',    'verb': 'GET'}

There are several modes: INT, matching integer; IP, matching ip v4 or ipv6; HOST, matching domain name; URIPATHPARAM, matching url address and parameters following http. As mentioned at the beginning, there are both strong matching capabilities of regular expressions and simple ease of use.

Repost this article please indicate the author and the source [Gary's influence] http://garyelephant.me, do not for any commercial purposes!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.