Python regular expression learning Notes

Source: Internet
Author: User
Tags closing tag html tags regular expression in python


A regular expression (regular expression) is a powerful logical expression used to match the form of text, and the RE module in Python provides support for regular expressions. Regular expressions consist of some common characters and some metacharacters (metacharacters). Ordinary characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings.

When a regular expression is a normal string, the matching behavior of a regular expression is a normal string lookup process. For example, the regular expression "testing" does not contain any metacharacters, it can match strings such as "testing" and "testing123", but it cannot match "testing" because of its case sensitivity. Some of the other metacharacters are not treated as normal characters, and they include. ^ $ * + ? { [ ] \ | ( ) 。

. Matches any character other than a newline; \w equivalent to [a-za-z0-9_] matches a single letter, number, or underscore character, while \w matches any single character that is not a letter, number, or underscore; \b Matches "single letter, number or underscore character" and "any non-letter, A boundary between a number and an underlined single character. \s is equivalent to [\n\r\t\f], matches a white space character (including spaces, newline, returns, tabs, tables), and \s matches all non-white-space characters; \ \ r is used in turn to match tabs, line breaks, returns, and \d equivalent to [0-9] for numbers that match decimal representations.

^ as the opening tag, $ as the closing tag, used to mark the start and end of a string, respectively. \ for the escape of some characters, such as \. Represents a match for a real point character, and \ \ Represents a match for a true backslash character. If you are not sure that some characters need to be escaped to match, you can add a slash, such as for @ You write \@ is certainly not a problem.

Import re

str = ' A cute word:cat!! '

Match=re.search (R ' word:\w\w\w ', str)

If match:

print ' found ', match.groups ()

This will use word:\w\w\w ', str to match the STR, and the regular expression's r marks the expression without escaping, which means that \ nthe object will not be treated as a newline after it is marked with R, and the match can point to the result of the match lookup.

Result output is cat

Import re
Print Re.search (R '.. G ', ' Piiig '). Group ()
Print Re.search (R ' \d\d\d ', ' p123g '). Group ()
Print Re.search (R ' \w\w\w ', ' @ @abcd! '). Group ()

The output is: IIG 123 ABC
Note: '.. G ' Find two letters before G
' \d\d\d ' finds three digits in a string
' \w\w\w ' finds three word characters in the string (if ' \w\w\w\w ' is output ABCD)
In regular expressions, we can use + and * to achieve repeated forms of expression, * representing 0 or more repetitions, + 1 or more repetitions.

Import re
Print Re.search (R ' pi+ ', ' Piiig '). Group ()
Print Re.search (R ' pi* ', ' PG '). Group ()
Output: PIII p


What is the function of square brackets? To concatenate a series of regular characters into or as a form.
Import re
Print Re.search (R ' [abc]+ ', ' Xxxacbbcbbadddedede '). Group ()
Print Re.search (R ' [a-d]+ ', ' Xxxacbbcbbadddedede '). Group ()
Output is: Acbbcbba acbbcbbaddd
# The former matches a continuous string of ABC characters
# The latter matches the two strings of all letters from A to D
Import re

str = ' Purple alice-b@jisuanke.com Monkey dishwasher '

Match=re.search ([\w.-]+) @ ([\w.-]+) ', str]

If match:

Print Match.group ()

Print Match.group (1)

Print Match.group (2)

Output: alice-b@jisuanke.com

Alice-b

Jisuanke.com

Note: The difference between adding a parameter in the group () function and not adding a parameter

Regular expressions contain too much semantics in a simple combination of characters, but they are so dense that you may have to spend too much time trying to write your regular expressions right. Here, we provide some simple suggestions to help you debug regular expressions more efficiently.

You can design a series of strings placed in the list for debugging, some of which can produce results that match the regular expression, and the other part that produces a result that does not conform to the regular expression. Note that when designing these strings, try to make their features look a little more different, so that they can be easily overwritten by the various regular expressions that we might have not written right. For example, for a regular expression that exists +, we might consider choosing a string that matches the * but does not conform to the +.

You can then write a loop that verifies that the string within each list matches a specified regular expression and that you match the expected results in another list that you have set. If there is an inconsistency, you should consider whether your regular expression needs to be modified, if the results are basically the same, Then we can consider further modifying the string we use for debugging or adding a new string.

In addition to the search method that was used before, we can also use another method named FindAll to match the result of finding all the found strings that match the regular results, and to get a result tuple as a list of elements

Import re
str = ' Purple alice@jisuanke.com, blah monkey bob@abc.com blah '
Tuples=re.findall (R ' ([\w\.-]+) @ ([\w\.-]+) ', str)
Print tuples

Output: [(' Alice ', ' jisuanke.com '), (' Bob ', ' abc.com ')]


Combine file operation and FindAll use to select the correct output of the following code

The textual content of Test.txt is as follows

The functions in the RE module for regular expressions have some optional parameters, and we can use the search () function or the FindAll () function to pass in additional parameters, such as Re.search (Pat, str, re. IGNORECASE) in the RE. IGNORECASE is using one of the tags in the RE as an additional parameter.

In the RE module, there are many different optional parameters, the IGNORECASE mentioned above represents the difference between the case when the match is ignored, and the other optional parameter Dotall if it is added, it is allowed in the regular. To cross row matching, after adding this parameter. * Such a match can be matched across rows, not just within a row. Also, an optional parameter is MULTILINE, which, after using it, will be used to match the start and end of each row with a string of multiple lines of text, and ^ and $ would only match the start and end of the entire string if it was not used.

In addition to optional parameters, we also need to understand the "greedy situation" of the regular match. Suppose we have a piece of text <b>foo</b> and <i>so on</i> and you want to match (<.*>) extract all the HTML tags, what do you think the result will be? Can we get the results of the desired <b>, </b>, <i>, </i>?

The actual result may be a little unexpected, because. * Such a match is "greedy" and it will try to get a long match, so we'll get a whole <b>foo</b> and <i>so on</i> As a result of a match. If we want to get the desired result, we need this match to be greedy, and in the regular expression, we can add a default greedy match for * and +. Make it into a not greedy.

That is, if we change (<.*>) to (<.*?>), the regular expression matches the <b>, then the </b>, and then the <i> and the </i> respectively. This matching result is exactly the same as our expectation. Correspondingly, for some cases where the + is used, we can change the + to +? To make a non greedy match.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.