Regular expressions in Python

Source: Internet
Author: User

First, the Prelude:
1. Regular expressions: Regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.


2. Because the regular expression is also represented by a string, we want to first understand how to use characters to describe a character, the following are the most commonly used descriptive characters:
\d: Indicates that a number can be matched, such as: ' 00\d ' can match ' 001 ', ' \d\d\d ' can match ' 010 '
\w: Indicates that a letter or number can be matched, for example: ' \w\w\w ' can match ' abc '
All of the above matches are exact matches, that is, a letter or a number to match one by one of the corresponding


3. So, for the variable length matching we need to use other characters, with * for any character (including 0), with a + to represent at least one character, with? represents 0 or 1 characters, with {n} representing n characters, with {n,m} representing n-m characters, in fact, as long as you understand that these symbols correspond to the role, It is not very difficult to understand that you can use it again by practicing.
Example: \d{3}\s+\w{3,8}:\d{3} represents three number combinations, \s+ has at least one space, \w{3,8} indicates 3 to 8 letters


4. To make a more accurate match, you can use [] to represent the range

(1). For example: [0-9a-za-z\_] can match a number, letter or underline, note that there is an underscore in front of the need to use a \ to escape the character;

[0-9a-za-z\_]+ can match a string of at least one number, letter, or underscore, note that a \ is used before the underscore to escape the character;

[A-za-z\_] [0-9a-za-z\_]* can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore;

[A-za-z\_] [0-9a-za-z\_] {0, 19} More precisely restricts the length of the variable to 1-20 characters (the preceding 1 characters + 19 characters later);
^ Represents the beginning of a line, and ^\d indicates that it must begin with a number.
$ represents the end of the line, and \d$ indicates that it must end with a number.
(2). In fact [] the role of a delimiter, each delimiter in the middle of the required number or letter range, followed by the value of {} is determined in the range of [].


Second, use:
1. We are using the match method in the RE module for matching, Re.match (R ' Regular expression ', ' need matching string '), we strongly recommend using the Python r prefix, do not consider the problem of escaping.
The 2.match () method determines if the match is successful, returns a Match object if the match succeeds, otherwise returns none, generally we can use it after the IF statement to make a successful match.

Example: Import re,time
A=input (' Please enter a landline number where you want to find your place of ownership ')
If Re.match (R ' [0-9]{4}[\-][0-9]{7} ', a):
Print (' The number you entered is correct, please wait for your inquiry ... ')
Time.sleep (3)
Print (' Hello, this is Xinjiang's landline number ')
Else
Print (' Input error please re-enter ')


>>> ================================ RESTART ================================
>>>
Please enter a landline number 0993-6999720 where you want to find your place of ownership
The number you entered is correct, please wait for your inquiry ...
Hello, this is the landline number of Xinjiang


3. If you want to match a comma, underline these special characters, you need to precede the regular expression with a \.


Three, cut the string:
1. We can use split in regular expressions using re.split (regular expression, a string that needs to be split), where the regular expression is split as the basis of the split string.

Example:>>> re.split (R ' [\s\;\,]+ ', ' as,sd we;was;,;bi ')

[' as ', ' SD ', ' we ', ' was ', ' bi ']
We can use this method to convert the non-canonical input to the correct list output.


Four, group:
1. In addition to simply judging whether or not to match, the regular expression also has the powerful function of extracting the substring, in () means the grouping to be extracted (group), in fact, the original []{} division () is expanded to represent a grouping.

Example: >>> m = re.match (R ' ^ ([0-9]{4})-([0-9]{7}) $ ', ' 1111-6666666 ')
>>> m
<_sre. Sre_match object; span= (0,), match= ' 1111-6666666 ' >
>>> M.group (1)
' 1111 '
>>> M.group (0)
' 1111-6666666 '
>>> M.group (2)
' 6666666 '
>>> m.groups ()
(' 1111 ', ' 6666666 ')
2. The occurrence of a () in the regular expression indicates a grouping, each () will match the subsequent string, the match succeeds in the grouping, we can use group to intercept the matching values in each group.
Note that group (0) is all intercepted, and the parameters inside are 0-based, and groups () returns the matching values in all groups as tuples


Five, greedy match:
1. A greedy match exists for grouping matches on regular expressions, because the regular match is a greedy match by default, which is to match as many characters as possible

Example: >>> m = re.match (R ' ^ (\d+) (6*) $ ', ' 1101016666666 ')
>>> m.groups ()
(' 1101016666666 ', ')
In this case the first group will all match the string, so the second grouping is the null value


2. Do we add a question mark to the last face of the first group? Make the group adopt a non-greedy match

Example: >>> m = re.match (R ' ^ (\d+?) (6*) $ ', ' 1101016666666 ')
>>> m.groups ()
(' 110101 ', ' 6666666 ')


Six, compile:
1. When we use regular expressions in Python, there are two things inside the RE module: ①, compiling a regular expression, or an error if the string of the regular expression itself is not valid;
②, use the compiled regular expression to match the string.


2. If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression, and we don't need to compile this step when we re-use it.
Direct match: >>> m = re.compile (R ' ^ (\d+?) (6*) $ ')
>>> m.match (' 1231313666666 '). Groups ()
(' 1231313 ', ' 666666 ')
The M object is generated after compilation, because the object itself contains a regular expression, so the corresponding method is called without giving a regular string.


Vii. Summary:
For regular expressions I think it can be understood that the so-called match is to see if the string needs to be checked to meet the regular rules you write, so-called regular rules are you define a number of situations, such as [com]* this regular rule, * means matching 0 or more times, so this regular rule we can understand as ' C ', ' com ', ' Co ', ' Om ', ' comcom ' and many more strings by c,o,m combination, as long as these conditions are counted to match successfully,
And [COM] means (the default 1 times) only match ' com ' and for the other cases mentioned above is wrong (note that if the parentheses are fixed string then {n} is a match n times, such as (COM) {2} So it is necessary to match comcom), I think the regular expression is actually a function of checking.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Regular expressions in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.