Further exploration of regular expressions in Python

Source: Internet
Author: User
Strings are the most data structure involved in programming, and the need to manipulate strings is almost ubiquitous. For example, to determine whether a string is a legitimate email address, although it can be programmed to extract the substring before and after, and then judge whether it is a word and domain name, but this is not only cumbersome, and the code is difficult to reuse.

A regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.

So the way we judge whether a string is a legitimate email is:

    • Create a regular expression that matches the email;
    • Use the regular expression to match the user's input to determine whether it is legal.

Because the regular expression is also represented by a string, we first know how to describe the character with characters.

In regular expressions, if a character is given directly, it is exactly the exact match. With \d you can match a number, \w can match a letter or a number, so:

    • ' 00\d ' can match ' 007 ', but cannot match ' 00A ';
    • ' \d\d\d ' can match ' 010 ';
    • ' \w\w\d ' can match ' py3 ';

. can match any character, so:

' py. ' Can match ' pyc ', ' pyo ', ' py! ' Wait a minute.

To match the variable length character, in the regular expression, use * to represent any character (including 0), with + for at least one character, to represent 0 or 1 characters, to represent n characters with {n}, and {n,m} to represent n-m characters:

Take a look at a complex example: \d{3}\s+\d{3,8}.

Let's read from left to right:

    • \d{3} indicates a match of 3 digits, e.g. ' 010 ';
    • \s can match a space (also including tab and other whitespace), so \s+ indicates at least one space, such as "," and so on;
    • \d{3,8} represents 3-8 digits, such as ' 1234567 '.

Together, the above regular expression can match a telephone number with an area code separated by any space.

What if I want to match a number like ' 010-12345 '? Because '-' is a special character, in the regular expression, to be escaped with ' \ ', so, the above is \d{3}\-\d{3,8}.

However, you still cannot match ' 010-12345 ' because there is a space. So we need more complex ways of matching.
Advanced

To make a more accurate match, you can use [] to represent a range, such as:

    • [0-9a-za-z\_] can match a number, letter or underline;
    • [0-9a-za-z\_]+ can match a string consisting of at least one number, letter, or underscore, such as ' A100 ', ' 0_z ', ' Py3000 ', etc.;
    • [A-za-z\_] [0-9a-za-z\_]* can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid Python variable;
    • [A-za-z\_] [0-9a-za-z\_] {0, 19} More precisely restricts the length of a variable to 1-20 characters (the preceding 1 characters + 19 characters later).

a| B can match A or b, so [P|p]ython can match ' python ' or ' python '.

^ Represents the beginning of a line, and ^\d indicates that it must begin with a number.

$ represents the end of the line, and \d$ indicates that it must end with a number.

You may have noticed that the Py can also match ' Python ', but with ^py$ it becomes an entire line match and only matches ' py '.
Re module

With the knowledge of readiness, we can use regular expressions in Python. Python provides the RE module, which contains the functionality of all regular expressions. Because the Python string itself is also escaped with \, pay special attention to:

s = ' abc\\-001 ' # python string # corresponding to the regular expression string becomes: # ' abc\-001 '

Therefore, we strongly recommend that you use the Python R prefix without having to consider escaping the problem:

s = R ' abc\-001 ' # python string # corresponding regular expression string invariant: # ' abc\-001 '

Let's look at how to tell if a regular expression matches:

>>> Import re>>> re.match (R ' ^\d{3}\-\d{3,8}$ ', ' 010-12345 ') <_sre. Sre_match object at 0x1026e18b8>>>> re.match (R ' ^\d{3}\-\d{3,8}$ ', ' 010 12345 ') >>>

The match () method determines if the match is true and returns a match object if the match succeeds, otherwise none is returned. The common judgment method is:

Test = ' user input string ' if Re.match (R ' Regular expression ', test): print ' OK ' else:print ' failed '

Slicing a string

Using regular expressions to slice a string is more flexible than a fixed character, see the normal segmentation code:

>>> ' a b C '. Split (') [' A ', ' B ', ', ' ', ' C ']

Well, you can't recognize contiguous spaces, try using regular expressions:

>>> Re.split (R ' \s+ ', ' a b C ') [' A ', ' B ', ' C ']

No matter how many spaces can be divided normally. To join, try:

>>> Re.split (R ' [\s\,]+ ', ' A, B, C d ') [' A ', ' B ', ' C ', ' d ']

Re-join; try:

>>> Re.split (R ' [\s\,\;] + ', ' A-B;; C d ') [' A ', ' B ', ' C ', ' d ']

If the user enters a set of tags, next time remember to use regular expressions to convert the nonstandard input into the correct array.
Group

In addition to simply judging whether a match is matched, the regular expression also has the power to extract substrings. The group (group) to be extracted is represented by (). Like what:

^ (\d{3})-(\d{3,8}) $ defines two groups, which can extract the area code and local numbers directly from the matching string:

>>> m = Re.match (R ' ^ (\d{3})-(\d{3,8}) $ ', ' 010-12345 ') >>> m<_sre. Sre_match object at 0x1026fb3e8>>>> m.group (0) ' 010-12345 ' >>> m.group (1) ' 010 ' >>> M.group (2) ' 12345 '

If a group is defined in a regular expression, a substring can be extracted with the group () method on the match object.

Notice that group (0) is always the original string, group (1), Group (2) ... Represents the 1th, 2 、...... Substring.

Extracting substrings is useful. Look at a more vicious example:

>>> t = ' 19:05:30 ' >>> m = re.match (R ' ^ (0[0-9]|1[0-9]|2[0-3]|[ 0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) $ ', T) >>> m.groups (' 19 ', ' 05 ', ' 30 ')

This regular expression can directly identify the legal time. However, there are times when it is not possible to fully validate with regular expressions, such as identifying dates:

' ^ (0[1-9]|1[0-2]| [0-9]) -(0[1-9]|1[0-9]|2[0-9]|3[0-1]| [0-9]) $'

For the ' 2-30 ', ' 4-31 ' such illegal date, with regular or not recognized, or write out to be very difficult, then the need for a program to identify the cooperation.
Greedy match

Finally, it should be noted that the regular match is a greedy match by default, which is to match as many characters as possible. For example, match the 0 following the number:

>>> Re.match (R ' ^ (\d+) (0*) $ ', ' 102300 '). Groups () (' 102300 ', ')

Since the \d+ uses greedy matching, the following 0 are all matched directly, the result 0* can only match the empty string.

You must let \d+ use a non-greedy match (that is, as few matches as possible) in order to match the back of the 0, add a? You can let the \d+ use a non-greedy match:

>>> Re.match (R ' ^ (\d+?) (0*) $ ', ' 102300 '). Groups () (' 1023 ', ' 00 ')

Compile

When we use regular expressions in Python, two things are done inside the RE module:

    1. Compiles the regular expression, if the regular expression string itself is illegal, will error;
    2. Use the compiled regular expression to match the string.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression and then reuse it without compiling this step, directly matching:

>>> Import re# compilation:>>> Re_telephone = Re.compile (R ' ^ (\d{3})-(\d{3,8}) $ ') # using:>>> re_ Telephone.match (' 010-12345 '). Groups (' 010 ', ' 12345 ') >>> re_telephone.match (' 010-8086 '). Groups () (' 010 ', ' 8086 ')

The regular expression object is generated after compilation, because the object itself contains a regular expression, so the corresponding method is called without giving a regular string.
Summary

The regular expression is very powerful, it is impossible to finish it in a short section. You can write a thick book if you want to know everything about the regular. If you frequently encounter problems with regular expressions, you may need a reference book for regular expressions.

Please try to write a regular expression that validates the email address. Version one should be able to verify a similar email:

Someone@gmail.combill.gates@microsoft.comtry

Version two can verify and extract the name of the email address:

 
  
   
   tom@voyager.org
 
  
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.