Python -- Regular Expression (3)

Source: Internet
Author: User

Python -- Regular Expression (3)

4. More powerful functions

So far, we have learned part of the regular expression. In this section, we will introduce some new metacharacters and how to use groups to retrieve some matched texts.

Certificate -----------------------------------------------------------------------------------------------------------------------------------------------------

4. 1. More metacharacters

Some metacharacters are called "zero-width assertions". They do not match any character, but simply indicate success or failure. For example, \ B indicates that the current position is at the boundary of a word, but \ B cannot change the position.

This means that the assertion with Zero Width cannot be reused, because if they match once at a given position, they can obviously match infinitely.

I

OR operator. If expression A and B are both regular expressions, A | B matches any string that can match A or B. To make or operator | more reasonable in Multi-character string matching,

Its priority is very low. For example, Crow | Servo will match Crow or Servo, instead of Cro and then [w] or [S], followed by ervo.

To match the character '|', use \ |, or put it in a character class, [|].

^

Match the starting position of a row. If the MULTILINE flag is not set, it only matches the starting position of the string.

If the MULTILINE flag is set, it matches the beginning of each line in the string (based on the linefeed.

For example, if you only want to match the word From at the beginning of a line, the regular expression should be ^ From.

 

>>> print(re.search('^From','From Here to Eternity'))<_sre.SRE_Match object; span=(0, 4), match='From'>>>> print(re.search('^From','Reciting From Memory'))None

$
Matches the end of a line. Take the following example:

 

>>> print(re.search('}$','{block}'))<_sre.SRE_Match object; span=(6, 7), match='}'>>>> print(re.search('}$','{block} '))None>>> print(re.search('}$','{block}\n'))<_sre.SRE_Match object; span=(6, 7), match='}'>
To match the '$' character, use \ $ or put it into a character class [$].

\
Only match the start position of the string. If it is not in MULTILINE mode, \ A and ^ have the same effect. In MULTILINE mode, they are different: \ A matches only the start of the string, and ^ matches the beginning of each line of the string, that is, each line break will also match.

\ Z
Only matches the end of the string.

\ B
Word boundary. This is a zero-width assertion that matches only the start or end position of a word. A word is a sequence of letters or numbers. Therefore, a word is ending with a space or a non-alphanumeric character.

In the following example, a class is matched only when it is used as a separate complete word, and a class is not matched when it is part of another word.

 

>>> p = re.compile(r'\bclass\b')>>> print(p.search('no class at all'))<_sre.SRE_Match object; span=(3, 8), match='class '>>>> print(p.search('the declassified algorithm'))None>>> print(p.search('one subclass is'))None>>> print(p.search('test:#class$'))<_sre.SRE_Match object; span=(6, 11), match='class'>

 

Note the following two points when using this special character.

First, some characters in the Python string and regular expression conflict. In a Python string, \ B indicates the return character (the ASCII value is 8). If you do not use the original character string, Python interprets \ B as the return character, therefore, this regular expression will not match as you want. The following example uses the same regular expression as the above example, but removes the 'R' character ':

>>> p = re.compile('\bclass\b')>>> print(p.search('no class at all'))None>>> print(p.search('\b'+'class'+'\b'))<_sre.SRE_Match object; span=(0, 7), match='\x08class\x08'>
Second, this assertion is useless in the character class, and \ B is equivalent to the Escape Character in Python in the character class.

 

\ B
This zero-width assertion is opposite to \ B and matches non-word boundary.

Certificate ------------------------------------------------------------------------------------------------------------------------------------------------------

4. 2. Group

In fact, in addition to regular expressions, you need to know more information. For complex content, regular expressions usually match different content by grouping.

For example, each line in a RFC-822 header uses semicolon ':' to divide it into names and values:

 

From: [email protected]User-Agent: Thunderbird 1.5.0.9 (X11/20061227)MIME-Version: 1.0To: [email protected]
In this case, you can write a regular expression to match the entire header, and then use the grouping function to make one group match the name of the header, and the other group matches the value corresponding to the name.

Use parentheses '(' and ')' in regular expressions to divide groups. The meanings of parentheses in regular expressions are the same as those in their mathematical expressions, they make the content in parentheses a group, and you can use repeated symbols to repeat the entire group (asterisks *, plus signs +, or {m, n }). For example, (AB) * matches 0 times or more times of AB.
>>> p = re.compile('(ab)*')>>> print(p.match('ababababab').span())(0, 10)
The sub-groups represented by parentheses '(' and ')' can also be indexed by level, and the index values can be passed to group (), start (), end (), and span. No. 0 indicates the first group, which always exists, that is, the whole regular expression itself. Therefore, the methods matching objects use no. 0 as the default parameter.

 

>>> p = re.compile('(a)b')>>> m = p.match('ab')>>> m.group()'ab'>>> m.group(0)'ab'
The Sub-group starts from 1 and ranges from left to right. In addition, groups can also be nested, so we can count the sequence numbers of sub-groups from left to right.
>>> p = re.compile('(a(b)c)d')>>> m = p.match('abcd')>>> m.group(0)'abcd'>>> m.group(1)'abc'>>> m.group(2)'b'
The group () function also allows multiple parameters. In this case, it returns a ancestor containing the specified group content:

 

 

>>> m.group(2,1,2)('b', 'abc', 'b')
The groups () method returns a parent, which contains the content of all sub-groups starting from 1:
>>> m.groups()('abc', 'b')
Reverse Reference refers to the ability to reference previously matched content in the following position, for example, \ 1 references the content of the first group, if the content in the current position is the same as that in group 1, the match is successful. Note that strings in Python use backslash and numbers to represent any character in the string. Therefore, remember to use the original string when using reverse references in regular expressions.

For example, the following regular expression matches words that appear two times in a row:
>>> p = re.compile(r'(\b\w+)\s+\1')>>> p.search('Paris in the the spring').group()'the the'
Reverse references are not often used when searching strings like this, because few text formats repeat characters like this. However, you will soon find that they are very useful when replacing strings.

Certificate -----------------------------------------------------------------------------------------------------------------------------------------------------
4. 3. Non-capturing group and naming Group
Well-designed regular expressions may be divided into many groups. These groups can not only match the relevant substring, but also group and structure the regular expression itself. In complex regular expressions, it is difficult for us to track the serial number of A group. There are two ways to solve this problem. The two methods use the same regular expression extension syntax, so let's take a look at the extension Syntax of this expression.

Perl5 provides many powerful extensions for standard regular expressions, perl developers cannot select a new metacharacter or use a backslash to construct a new special sequence to implement the extension function, because it conflicts with standard regular expressions. For example, if they select & as a new metacharacters, the old expression regards & as a special regular character, however, it does not remove its special meaning by writing \ & or.

The final solution is to select the Perl developer (?...) As an extension syntax, question mark? Followed by parentheses is itself a syntax error because there are no characters in front of it that can be repeated, so this solves the compatibility problem. The character following the question mark specifies which extension function will be used, for example ,(? = Foo) is an extended feature (Forward assertions ),(? : Foo) is another extension (a non-capturing group containing the sub-string foo ).

Python supports some extension syntaxes of Perl and adds an extension syntax. If the question mark? If the subsequent character is P, it is certainly a Python extension syntax.

Now we know the extension syntax, so let me look back at how these extension syntaxes work in complex regular expressions.


Sometimes you want to use a group to specify a part of the regular expression, but do not care about the group Matching content. You can implement this function through a non-capturing group :(? :...), Ellipsis... You can replace it with any regular expression:

>>> m = re.match('([abc])+','abc')>>> m.groups()('c',)>>> m = re.match('(?:[abc])+','abc')>>> m.groups()()
In addition to being unable to get any content, a non-capturing group is similar to a capturing group in other aspects. You can place any regular expressions in it and use metacharacters with repeated functions, or nest it into other groups (captured or non-captured ). (? :...) A non-capture group is useful when you modify an existing mode, because adding a non-capture group does not affect the sequence numbers of other capture groups. It is worth mentioning that there is no difference in the search speed between non-capture groups and capture groups.

Another important feature is naming groups. You can use this feature to specify a name rather than a serial number.

The naming group syntax is Python-specific extension syntax :(? P ...). In this regular expression, name clearly refers to the group name. The behavior of the named group and the capture group is the same, but it can be accessed by a name. All methods for matching objects can not only process capture groups referenced by numbers, but also name groups referenced by strings. The name Group still has an Sn, so you can get the content by name or Sn:

 

 

>>> p = re.compile(r'(?P
  
   \b\w+\b)')>>> m = p.search('((((Lots of punctuation)))')>>> m.group('word')'Lots'>>> m.group(1)'Lots'
  

By using a name to access a group, you do not need to remember the serial number of the group, which makes processing easier. The following is an example of a regular expression in the imaplib module:

 

>>> InternalDate = re.compile(r'INTERNALDATE "'  r'(?P
  
   [ 123][0-9])-(?P
   
    [A-Z][a-z][a-z])-'  r'(?P
    
     [0-9][0-9][0-9][0-9])'  r'(?P
     
      [0-9][0-9]):(?P
      
       [0-9][0-9]):(?P
       
        [0-9][0-9])' r'(?P
        
         [-+])(?P
         
          [0-9][0-9])(?P
          
           [0-9][0-9])' r'"')
          
         
        
       
      
     
    
   
  

 

This allows you to conveniently obtain the content through m. group ('zonem ') without remembering the group number 9.

In a regular expression, the reverse reference syntax is similar to the following :(...) \ 1. Use numbers to reference groups. In a naming group, we can naturally think of referencing the previous group by name. This is another Python Extension :(? P = name) indicates that the content of the current position is the content of the group named name. A regular expression (\ B \ w +) \ s + \ 1 that matches two consecutive words can also be written (? P \ B \ w +) \ s + (? P = word)
>>> p = re.compil(r'(?P
   
    \b\w+)\s+(?P=word)')>>> p.search('Paris in the the spring').group()'the the'
   

Certificate ------------------------------------------------------------------------------------------------------------------------------------------------------

4. Forward assertions
Forward assertions are another zero-width assertion. Forward assertions include forward positive assertions and forward negative assertions, as described below:

(? = ...)
Forward affirmation. If the currently contained regular expression matches successfully at the current position, it indicates that the regular expression is successful. Otherwise, the regular expression fails. If this part of the regular expression has been tried by the matching engine, the matching will not continue. The remaining pattern continues to be attempted at the beginning of this assertion.

(?!...)
Forward denial assertions. This is opposite to positive assertions. If the difference is not true, the request is successful, and the request fails.

To be more specific, let's use a case to look at the role of forward assertions. Consider writing a simple pattern to match a file name. The file name uses a dot '. 'split into two parts: Name and extension, for example, new. rc and news are the file names, while rc is the file extension.

The pattern for matching file names is very simple, as follows:

 

.*[.].*$
Note that the dot '.' Is A metacharacter, so the dot in the middle of the file name needs to be put into a character class to remove its special features. At the same time, pay attention to the end of $, which is used to ensure that the rest of the string is included in the extension. This regular expression matches foo. bar, autoexec. bat, sendmail. cf, and printers. conf.

Now, consider a slightly complicated situation. If you want to match a file name whose extension is not bat, let's take a look at your attempt to write an error:
.*[.][^b].*$
This attempt attempts to match the name of the first letter of the extension that is not the name of B to exclude files with the extension bat. But you need to know that the first letter of the foo. bar extension bar is also B. This method will exclude it together.

Therefore, you can modify the following statement based on the preceding solution. This expression becomes a bit complex:
.*[.]([^b]..|.[^a].|..[^t])$
If the first letter of the extension is not B, the second letter of the extension is not a, or the third letter of the extension is not t, the file names that comply with these rules will be matched. In this case, foo. bar is matched, and autoexec. bat is excluded. However, it can only match a file name with an extension containing three letters, and the extension containing two letters, such as sendmail. cf, will be excluded. So we will continue to fix this regular expression:
.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
In this regular expression, question marks are used for the second and third letters of the extension? It becomes optional, so that it can match extensions with less than three letters, such as sendmail. cf.

However, the regular expression is complex and difficult to read and understand. Worse, if your requirements change and you want to exclude files with extensions bat and exe, this regular expression will become very complex.

In this case, the forward negation can easily solve these problems.
.*[.](?!bat$).*$
The meaning of the forward negation assertion is as follows: if the expression bat does not match in the current position, try the remaining pattern. If bat $ matches, the entire regular expression will fail. (?! Bat $) The end character '$' At the end of bat $ ensures that the file name with the extension like sample. batch starting with bat can be properly matched.

 


Therefore, it is easy to exclude the extension of another file, as long as it is added to the assertion in the OR method selected. For example, the following regular expression can exclude files with extensions bat and exe:

.*[.](?!bat$|exe$).*$

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.