(Data Science Learning Codex) a detailed introduction to the RE module in Python

Source: Internet
Author: User

First, Introduction

As for regular expressions, I have already made a detailed introduction in the previous (Data Science Learning Codex 31), which summarizes the common functions of the self-contained module re in Python.

As a module supported by Python for regular expression related functions, re provides a series of methods to complete the processing of almost all types of textual information, as described below:

Second, Re.compile ()

In the previous article, we used this method to return the matching pattern of a target object by compiling the regular expression parameters, thus improving the efficiency of the regular expression, the main parameters are as follows:

Pattern: The input to compile the regular expression, you need to wrap the regular expression inside ", such as ' aa* '

Flags: Compile flag bits, which are used to modify the matching of regular expressions from an angle, commonly used:

Re. S: make. Matches all characters including line breaks

Re. I: Make the match case insensitive

Re. U: Parsing characters According to Unicode rules, mainly used in the matching of Chinese

Here are a few simple examples:

Import' even if you've never heard of Wikipedia's six-degree separation theory, it's also likely to have heard the Kevin Bacon (Kevin Bacon) six-degree separator game. In both games, there are two irrelevant themes (Wikipedia is a link between the terms, Kevin Bacon's six-degree separator game is connected with an actor that appears in the same movie (including the original two topics) with a theme of no more than six articles.  " ' compiles our regular expression, the rule is to find all the contents within the double quotation marks (not including the double quotes)'= re.compile ('  "(. *?)" ' )' print match result ' (Regex.findall (text))

Operation Result:

As you can see, all the contents of the match are returned as a list;

Import' even if you've never heard of Wikipedia's six-degree separation theory, it's also likely to have heard the Kevin Bacon (Kevin Bacon) six-degree separator game. In both games, there are two irrelevant themes (Wikipedia is a link between the terms, Kevin Bacon's six-degree separator game is connected with an actor that appears in the same movie (including the original two topics) with a theme of no more than six articles.  " ' compiles our regular expression, the rule for the case that the letter of the English alphabet appears at least once "' = Re.compile (' [a-za-z]+ ' )' print match result ' (Regex.findall (text))

Operation Result:

Next we assign a value to the flags parameter to see what the function will be:

 import   Retext  = "   Even if you've never heard of the Wikipedia six-degree separation theory, you're probably hearing the "Kevin Bacon (Kevin Bacon) six-degree separator game." In both games, there are two irrelevant themes (Wikipedia is a link between the terms, Kevin Bacon's six-degree separator game is connected with an actor that appears in the same movie (including the original two topics) with a theme of no more than six articles.   " " "  Compile our regular expression, the contents of the rule for the lowercase English letter at least once   regex  = Re.compile ( " [a-z]+  " ) #   Not using flags ignores case    print matching results   print  (Regex.findall (text)) 

Operation Result:

Because the regular expression we use is [a-z]+, so the uppercase part fails to match, we do not change our regular expression section, but we will take the flags:

 import   Retext  = "   Even if you've never heard of the Wikipedia six-degree separation theory, you're probably hearing the "Kevin Bacon (Kevin Bacon) six-degree separator game." In both games, there are two irrelevant themes (Wikipedia is a link between the terms, Kevin Bacon's six-degree separator game is connected with an actor that appears in the same movie (including the original two topics) with a theme of no more than six articles.   " " "  Compile our regular expression, the contents of the rule for the lowercase English letter at least once   regex  = Re.compile ( " [a-z]+  " , Flags=re. I) #   use RE. I ignore case    print match result  "  print  (Regex.findall (text)) 

Operation Result:

When using Flags=re. I to ignore the case, the original regular expression based on the implementation of the capital letter matching.

Third, Re.match ()

This method is not used by a lot of people, it means that a defined regular expression as a match to the beginning of the target string (does not match the first part), the following is a simple example:

Import'what iswaiting for? ' " " successful match to the beginning because the string starts with W " " Print (Re.match ('w', Text,re. I). Group ())

Operation Result:

When the beginning of a string does not match, even if the other part of the string has a match, it does not return a value (that is, the so-called only first part):

Import'what iswaiting for? Where is fucking from? ' " " could not successfully match to the beginning because the string starts with WHA " " Print (Re.match ('whe', Text,re. I))

Operation Result:

Iv. Re.search ()

Re.search () uses a format similar to Re.match (), which is three incoming parameters: Pattern,string,flags, but unlike match match, search matches the first part of the string that satisfies the condition in the text and returns For subsequent no longer matching, here is a simple example:

Import'what iswaiting for? Where is fucking from? ' " " successful match to the first occurrence of the target content, the subsequent content will no longer match " " Print (Re.search ('a', Text,re. I). Group ())

Operation Result:

There are many aces in the text, but search stops matching and returns the first value when it encounters the first A;

Notice here that the group () method that I used in the previous examples is the object that matches and returns successfully for match or search, which we call match object, and the usual methods around it are as follows:

Strat (): Returns the position where the match started

End (): Returns the position where the match ended

Group (): Returns a string that is matched by re

Span (): Returns a tuple-formatted object that marks the beginning of the match, the end position, and the shape (start,end)

In fact, although search returns only one object, we can return multiple blocks of objects by transforming the regular expression into a number of sub-expression stitching

ImportRetext='1213sdsdjAKNNK'" "matches the contents of the compound expression (the returned object is chunked according to the subexpression) and prints the 1th, 2, and 3 sub-contents respectively" "Print(Re.search ('([1-9]+) * ([a-z]+) * ([a-z]+)', text). Group (1))Print(Re.search ('([1-9]+) * ([a-z]+) * ([a-z]+)', text). Group (2))Print(Re.search ('([1-9]+) * ([a-z]+) * ([a-z]+)', text). Group (3))

Operation Result:

Wu, FindAll ()

Note that this differs from the FindAll () spelling that we used when parsing the BeautifulSoup object (although functionally similar), unlike match and search, it extracts all the conforming parts of the target string according to the part of the regular expression passed in. and outgoing as a list of forms, here is a simple example:

Import' even if you've never heard of Wikipedia's six-degree separation theory, it's also likely to have heard the Kevin Bacon (Kevin Bacon) six-degree separator game. In both games, there are two irrelevant themes (Wikipedia is a link between the terms, Kevin Bacon's six-degree separator game is connected with an actor that appears in the same movie (including the original two topics) with a theme of no more than six articles.  " matches all of the text in the length of the 2 string "'print(' Re.findall (' listen. ', text))

Operation Result:

Unlike the previous usage of FindAll in the introduction of Re.compile (), here is the format of the Re.findall (regular expression, target string), preceded by a compiled regular pattern. FindAll (target string), the function of the two formats is equivalent;

Liu, Re.finditer ()

We sometimes encounter situations where the target string is very long (possibly a whole novel), and the target content of our regular expression is very much, if we follow the previous practice using Re.findall () to extract all the results out of a huge list, is a very memory-intensive thing, and Python's memory-saving generator (generator) comes in handy;

Re.finditer (pattern,string,flags=0) takes advantage of this mechanism, which constructs a generator based on the regular expression pattern and the string of target strings, so that we can calculate the value of the corresponding position at the Edge Loop edge in the loop of the generator. That is, from the beginning to the end of each round only save the current position and the current match to the content, to achieve memory saving effect, the following is a simple example:

ImportRetext='Abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi'" "construct our iterators" "obj= Re.finditer ('A.', text)" "iterates over obj, returning the contents of the current position and the corresponding start and end positions each time" " forIinchobj:Print(I.group ())Print(I.span ())

Operation Result:

Vii. Re.sub ()

Replace () in a string-like operation, except that fixed content is only rigidly set in replace (), and Re.sub (Pattern,repl,string,count) can be used to achieve flexible match substitution based on regular expressions. pattern specifies the regular expression part, REPL specifies the new content to be replaced, string specifies the target string, count specifies the number of substitutions, and the default is all replaced, in fact, at the end of the previous article we get a clean news report on the use of this method, Here's a simple example:

Import'abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi' " " to construct our alternative rules " "  = re.sub ('A. ',' hehe ', ' text ')' print replacement content '   print(obj)

Operation Result:

Viii. Re.split ()

Similar to split () in string processing, re.split () expands the function of regular expressions on the original basis, Re.split (pattern,string,maxsplit), where pattern specifies the regular expression of the delimiter, string specifies the target string, maxsplit specifies the maximum number of splits, and here is a simple example:

Import'abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi' " " to construct our segmentation rules " "  = re.split ('i. ' , text) " " Print split content " " Print (obj)

Operation Result:

  

The above is about the RE module commonly used functions, the next will be a real combat to detail the actual business of the network data collection process.

(Data Science Learning Codex) a detailed introduction to the RE module in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.