Python Regular Expressions

Last Update:2016-08-14 Source: Internet

Author: User

Tags closing tag repetition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The regular expression (regular expression) is a powerful logical expression that matches the form of text, and the RE module in Python provides support for regular expressions. Regular expressions consist of some ordinary characters and some meta characters (metacharacters). Ordinary characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings.

When a regular expression is a normal string, the matching behavior of a regular expression is a normal string lookup process. For example, the regular expression "testing" does not contain any metacharacters, it can match strings such as "testing" and "testing123", but it cannot match "testing" because of the case sensitivity. Some other metacharacters are not treated as normal characters, they are included . ^ $ * + ? { [ ] \ | ( ) .

.Matches any character \w except newline, which matches a [a-zA-Z0-9_] single letter, number, or underscore character, and \W matches any single character that is not a letter, number, or underscore; \b matches the " The boundary between a single letter, number, or underscore character "and" any single character that is not a letter, number, and underscore. Equivalent to a white space character (including spaces, line breaks, returns, tabs, tables) that \S matches all non-whitespace characters \t \n \r , followed by a tab match \s [ \n\r\t\f] , newline character, return character \d , and equivalent to the [0-9] number used to match the decimal representation.

^As the start tag $ , as the closing tag, used to mark the beginning and end of a string, respectively. Used for escaping some characters, such \. as a match for a real point character, \\ indicating a match for a real backslash character, and so on. \ If you are not sure if some characters need to be escaped in order to match, you can always add slashes, for example, @ you \@ will have no problem writing.

Import re
str = ' A cute word:cat!! '
Match=re.search (R ' word:\w\w\w ', str)

If match:

print ' found ', match.groups ()

Here will use word:\w\w\w ', str this regular to match Str to find, the regular expression before the R marked the expression is not to do escape processing, that is, \ n This thing is again marked by R will not be treated as a newline, the match variable will point to the results of matching lookups.

Result output is cat

Import re
Print Re.search (R ' ... G ', ' Piiig '). Group ()
Print Re.search (R ' \d\d\d ', ' p123g '). Group ()
Print Re.search (R ' \w\w\w ', ' @ @abcd! '). Group ()

The output is: IIg   123  ABC
Note: '. G ' Find the two letters before G
    ' \d\d\d ' finds three digits in a string
    ' \w\w\w ' finds three word characters in a string  (output ABCD if ' \w\w\w\w ')

In regular expressions we can use + and * to achieve repeated forms of expression, * representing 0 or more times of repetition, + 1 or more times of repetition

Import re
Print Re.search (R ' pi+ ', ' Piiig '). Group ()
Print Re.search (R ' pi* ', ' PG '). Group ()

Output: PIII  P


The square brackets function? To concatenate a series of regular characters in or out.

Import re
Print Re.search (R ' [abc]+ ', ' Xxxacbbcbbadddedede '). Group ()
Print Re.search (R ' [a-d]+ ', ' Xxxacbbcbbadddedede '). Group ()

output is :acbbcbba  acbbcbbaddd

# The former matches a continuous string consisting of ABC
# The latter matches the two permissible strings of all the letters from A to D

Import re
str = ' Purple [email protected] monkey dishwasher '
Match=re.search (' ([\w.-]+) @ ([\w.-]+) ', str)
If match:
Print Match.group ()
Print Match.group (1)
Print Match.group (2)

Output: [email protected]
Alice-b
Jisuanke.com

Note: The difference between adding and not adding parameters to the group () function

Regular expressions use a combination of simple characters that contains too much semantics, but they are too dense to write your regular expression to, and you may have to spend too much time. Here, we offer some simple suggestions to help you debug regular expressions more efficiently.

You can design a series of strings that are placed in the list for debugging, some of which can produce results that conform to regular expressions, and another that produces results that do not conform to regular expressions. Note that when designing these strings, it is possible to make their features behave more differently, so that it is easy to overwrite the errors that are not written to the various regular expressions that we may appear. For example, for an existing + regular expression, we might consider a string that matches * but does not conform +

You can then write a loop to verify that the string within each list matches a specified regular expression and match the expected result in another list that you set, and if there is an inconsistency, you should consider whether your regular expression needs to be modified or not, and if the results are basically consistent, Then we can consider further modifying the string we use for debugging or adding a new string.

In addition to the previous search method, after binding the parentheses, we can also use another method named FindAll to match the result of finding all the found strings to match the regular results, and get a list of result tuples as elements

Import re
str = ' Purple [email protected], blah monkey [email protected] blah dishwasher '
Tuples=re.findall (R ' ([\w\.-]+) @ ([\w\.-]+) ', str)
Print tuples

Output: [(' Alice ', ' jisuanke.com '), (' Bob ', ' abc.com ')]


Combine file operations with FindAll to select the correct output for the following code

Test.txt text content is as follows

The functions in the RE module for regular expressions have some optional parameters, and we can use the search () function or the FindAll () function to pass in additional parameters, such as re.search(pat, str, re.IGNORECASE) re.IGNORECASE using a tag in the re as an additional parameter.

In the RE module, there are many different optional parameters, the above mentioned is IGNORECASE to let the match ignore the case of the difference, and the other optional parameter DOTALL if added, will allow the regular in the . cross-line matching, add this parameter after the .* matching method, will be able to match across rows, not just within a line. Also, an optional argument is MULTILINE that after using it, a string consisting of multiple lines of text ^ $ will be used to match the start and end of each line, and if it is not used, it ^ $ will only match the beginning and end of the entire string.

In addition to the optional parameters, we also need to understand the "greedy case" of regular matches. Suppose we have a text foo and so on and you want to match (<.*>) all of the extracted HTML tags, what do you think the result would be? Can we get the desired,,    ,  such a result?

The fact that the results may be somewhat unexpected, because .* such a match is "greedy", it will be as far as possible to get a longer match results, so we will get a whole foo and so on as a result of matching. If we want to achieve the desired result, we need this match is not greedy, in the regular expression, we * + have and this default greedy match can be added to ? make it not greedy.

That is, if we were to (<.*>) change (<.*?>) , the regular expression would match first,  then match  , and then the next   . The results are exactly the same as our expectations. Correspondingly, for some used + cases, we can change + +? to a non-greedy match.

Python Regular Expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More