Use the RE in Python for information filtering

Source: Internet
Author: User

Background

In peacetime work, we often deal with a large number of metadata (Raw data), and the general file Editor can only query one keyword at a time, it is difficult to continuously analyze metadata, such as analysis of product log files (log), logs may include many information level of information, These are generally not much of our concern, our main concern is some special debugging (debug) level of information, so it is necessary to filter out the log files according to a lot of the information we have, so the filtered log file is not only a continuity, and legibility is very good.

Solution Solutions

Re is a regular expression library file with Python, which provides a great convenience for matching filtering of strings, and this article uses re to filter the information of log files. First, take a brief look at the main functions in Re:

1. Compile (pattern, flag): Compiling a regular expression is more accurate than checking the syntax. Flag is a compiled tag, which only describes Dotall, which means matching all characters, including new lines.

>>>ImportRe>>> Re.compile ('[abc]+') Re.compile ('[abc]+')>>>re.compile (Test) Traceback (most recent): File"<stdin>", Line 1,inch<module>Nameerror:name'Test'  is  notdefined>>>

2. Match (): From the beginning of the target string to determine whether to match the regular expression, if the mismatch returns none, conversely, returns the matching object, including the starting position, the end position, the string content

Import re>>> test = re.compile ('[abc]+')>>> Test.match ( ' DABC ' )>>> test.match ('babc')<_sre. Sre_match object; Span= (0, 4), match='babc'>

Test is a regular expression compilation object that starts with a or B or C, and match is matched from the beginning of the target string, so the first target string "DABC" does not conform to the regular expression rule, so none is returned, and the second target string can match the output matching object normally (starting position, Match), since the match is matched each time from the beginning of the target string, if there is a matching string, its start position is always 0.

3. Search: Similar to the match function, search scans the full-target string for regular expression matching.

Import re>>> test = re.compile ('[abc]+')>>> Test.search ( ' DABC ' )<_sre. Sre_match object; Span= (1, 4), match='abc'

You can then use search to match the string at the beginning of the a,b,c.

4. FindAll: Finds all matching strings in the target string and returns them as a list

>>> test = re.compile ('\[email protected]')>>> Test.findall (r " [email protected]@[email protected] " ) ['[email protected]'[email protected]

Of course, there are a number of other functions in re that you can use, and you may want to check the official Python documentation.

Secondly, we introduce some common symbols of regular expressions:

1. *: Indicates matching its preceding character 0 or more

2.: Indicates that all characters except the new line are matched

3. |: Express or manipulate

4. +: represents one or more occurrences of the character immediately preceding it

5.: Indicates matching 0 or 1 times

Other regular expressions can also be viewed on official website documents.

Finally, let's take a look at this simple filter program:

ImportResource='GCM.txt'Target='G2s.txt'#First- level screeningRaw_compile = Re.compile (r"<g2s:g2sMessage.*?</g2s:g2sMessage>", Re. Dotall)#second-level screeningMessagelevel_compile = Re.compile (r"<igtlicensing.*|<g2s:idreader.*", Re. Dotall)#second-level screeningEgmlevel_compile = Re.compile (r"igt_00012e2335aa.*", Re. Dotall)deffilterg2smessage (): Fr=Open (source) content=Fr.read () Fr.close () F= Open (target,'W') G2sitems=raw_compile.findall (content) forG2sinchg2sitems:iscaredg2s=Messagelevel_compile.search (g2s) iscaredegm=Egmlevel_compile.search (g2s)ifIscaredg2s andIscaredegm:f.write (g2s+'\ n')        Else:            Passf.close () filterg2smessage ( )

The program is very simple, in the process of screening you can first analyze the level of filtering, you can filter by step.

Summarize:

Re not only provides regular expression matching, but also provides a number of batch processing functions, such as SPLIT,SUB,SUBN, and so on, these functions can improve our fast processing of file content, save time.

Use the RE in Python for information filtering

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.