41 Regular Expressions

Source: Internet
Author: User

Strings are the most data structure involved in programming, and the need to manipulate strings is almost ubiquitous. For example, to determine whether a string is a legitimate email address, although you can programmatically extract @ the substring before and after, and then judge whether it is a word and domain name, but this is not only cumbersome, and code is difficult to reuse.

A regular expression is a powerful weapon used to match strings. Its design idea is to use a descriptive language to define a rule for a string, and any string that conforms to the rule, we think it "matches", otherwise the string is illegal.

So the way we judge whether a string is a legitimate email is:

    1. Create a regular expression that matches the email;

    2. Use the regular expression to match the user's input to determine whether it is legal.

Because the regular expression is also represented by a string, we first know how to describe the character with characters.

In regular expressions, if a character is given directly, it is exactly the exact match. To match a \d number, \w you can match a letter or a number, so:

    • ‘00\d‘Can match ‘007‘ , but cannot match ‘00A‘ ;

    • ‘\d\d\d‘can match ‘010‘ ;

    • ‘\w\w\d‘can match ‘py3‘ ;

.Can match any character, so:

    • ‘py.‘Can match ‘pyc‘ , ‘pyo‘ , and ‘py!‘ so on.

To match a variable-length character, in a regular expression, with a representation of * any character (including 0), with a representation of + at least one character, representing ? 0 or 1 characters, with a representation of {n} n characters, represented by {n,m} n-m characters:

Take a look at a complex example: \d{3}\s+\d{3,8} .

Let's read from left to right:

    1. \d{3}Indicates a match of 3 digits, for example ‘010‘ ;

    2. \sCan match a space (also including tab and other white space characters), so that \s+ there is at least one space, such as matching ‘ ‘ , ‘ ‘ etc.;

    3. \d{3,8}Represents a 3-8 number, for example ‘1234567‘ .

Together, the above regular expression can match a telephone number with an area code separated by any space.

What if you want to match ‘010-12345‘ a number like this? Because ‘-‘ it is a special character, it is escaped in the regular expression, ‘\‘ so the above is \d{3}\-\d{3,8} .

However, there is still no match ‘010 - 12345‘ because there are spaces. So we need more complex ways of matching.

Advanced

To make a more accurate match, you can use a [] representation range, such as:

    • [0-9a-zA-Z\_]Can match a number, letter, or underscore;

    • [0-9a-zA-Z\_]+Can match a string of at least one number, letter, or underscore, for example, and ‘a100‘ ‘0_Z‘ ‘Py3000‘ so on;

    • [a-zA-Z\_][0-9a-zA-Z\_]*It can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid Python variable;

    • [a-zA-Z\_][0-9a-zA-Z\_]{0, 19}More precisely limit the length of a variable to 1-20 characters (1 characters before + 19 characters later).

A|BCan match A or B, so [P|p]ython you can match ‘Python‘ or ‘python‘ .

^Represents the beginning of a row, ^\d indicating that a number must begin.

$Represents the end of a line, indicating that it \d$ must end with a number.

You may have noticed it, but you can match it, py ‘python‘ but plus ^py$ it turns into an entire line match, it only matches ‘py‘ .

Re module

With the knowledge of readiness, we can use regular expressions in Python. Python provides a re module that contains the functionality of all regular expressions. Because the Python string itself is also \ escaped, pay special attention to:

' abc\\-001 '  'abc\-001'

Therefore, we strongly recommend that you use the Python r prefix without having to consider escaping the problem:

s = R'abc\-001'abc\-001'

Let's look at how to tell if a regular expression matches:

 >>> import re  >>> Re.match (r   ^\d{3}\-\d{3,8}$   ", "  010-12345   " )  <_sre. Sre_match object ; span= (0 ,  '  010-12345   '  >>>> Re.match (R "  ' ,  " 010 12345   '   '  >>> 

match()The method determines whether the match is successful, returns an object if the match succeeds, Match or returns None . The common judgment method is:

' user-entered string ' if re.match (r' Regular expression ', test):    print ('OK ' )else:    print ('failed')
Slicing a string

Using regular expressions to slice a string is more flexible than a fixed character, see the normal segmentation code:

' a B   c'. Split (') ['a' ] b ' "' "' ' C ']

Well, you can't recognize contiguous spaces, try using regular expressions:

>>> Re.split (R'\s+'a b   c') ['  a"b" 'C')

No matter how many spaces can be divided normally. Add to try , :

>>> Re.split (R'[\s\,]+'A, B, C  d') [ ' a ' ' b ' ' C ' ' D ']

Try again ; :

>>> Re.split (R'[\s\,\;] +'b;; c  D') ['a']  b'c'd']

If the user enters a set of tags, next time remember to use regular expressions to convert the nonstandard input into the correct array.

Group

In addition to simply judging whether a match is matched, the regular expression also has the power to extract substrings. The () Grouping (group) to be extracted is represented by the. Like what:

^(\d{3})-(\d{3,8})$Two groups are defined separately, and the area code and local numbers can be extracted directly from the matching string:

>>> m = Re.match (r'^ (\d{3})-(\d{3,8}) $','010-12345')>>>m<_sre. Sre_matchObject; Span= (0,9), match='010-12345'>>>> M.group (0)'010-12345'>>> M.group (1)'010'>>> M.group (2)'12345'

If a group is defined in a regular expression, you can extract the substring from the Match object using a group() method.

Notice that group(0) it is always the original string, group(1) group(2) ... Represents the 1th, 2 、...... Substring.

Extracting substrings is useful. Look at a more vicious example:

' 19:05:30 '>>> m = re.match (R'^ (0[0-9]|1[0-9]|2[0-3]|[ 0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) $', T)>>> m.groups () (' + ') ("')"

This regular expression can directly identify the legal time. However, there are times when it is not possible to fully validate with regular expressions, such as identifying dates:

' ^ (0[1-9]|1[0-2]|[ 0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[ 0-9]) $'

For ‘2-30‘ , ‘4-31‘ such illegal date, with regular or can not be recognized, or write out to be very difficult, then need to program with identification.

Greedy match

Finally, it should be noted that the regular match is a greedy match by default, which is to match as many characters as possible. For example, match the following numbers 0 :

>>> Re.match (R'^ (\d+) (0*) $'102300'). Groups ()('102300')

Because \d+ of the greedy match, directly the back of 0 all matching, the result 0* can only match the empty string.

\d+a non-greedy match (that is, as few matches as possible) must be used in order to match the latter 0 and add a ? \d+ non-greedy match to it:

>>> Re.match (R'^ (\d+?) (0*) $'102300'). Groups () ('1023')  'xx')
Compile

When we use regular expressions in Python, two things are done inside the RE module:

    1. Compiles the regular expression, if the regular expression string itself is illegal, will error;

    2. Use the compiled regular expression to match the string.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression and then reuse it without compiling this step, directly matching:

>>>import re# Compile:>>> Re_telephone = Re.compile (r'^ (\d{3})-(\d{3,8}) $') # using:>>> Re_telephone.match ('010-12345'). Groups () ('010','12345')>>> Re_telephone.match ('010-8086'). Groups () ('010','8086')

The regular expression object is generated after compilation, because the object itself contains a regular expression, so the corresponding method is called without giving a regular string.

Summary

The regular expression is very powerful, it is impossible to finish it in a short section. You can write a thick book if you want to know everything about the regular. If you frequently encounter problems with regular expressions, you may need a reference book for regular expressions.

Reference source

regex.py

41 Regular Expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.