Explore Regular Expressions in Python

Source: Internet
Author: User
Tags valid email address

Explore Regular Expressions in Python

This article mainly introduces some usage of Regular Expressions in Python. The use of regular expressions is an important knowledge in Python learning. For more information, see

String is the most involved data structure during programming, and the demand for string operations is almost everywhere. For example, to determine whether a string is a valid Email address, although it can be programmed to extract the substring before and after @ and then determine whether it is a word or a domain name separately, this is not only troublesome, but also difficult to reuse the code.

Regular Expressions are powerful weapons used to match strings. It is designed to define a rule for a string in a descriptive language. Any character string that complies with the rule will be considered "matched". Otherwise, this string is invalid.

Therefore, we can determine whether a string is valid by Email:

Create a regular expression that matches the Email;

Use this regular expression to match the user's input to determine if it is legal.

Because regular expressions are also represented by strings, we need to first understand how to use characters to describe characters.

In a regular expression, if a character is directly given, exact match is performed. Use \ d to match a number, and \ w to match a letter or number, so:

'00 \ d' can match '007 ', but cannot match '00a ';

'\ D \ d' can match '010 ';

'\ W \ d' can match 'py3 ';

. Can match any character, so:

'Py. 'Can match 'pyc', 'pyo', and 'py! 'And so on.

To match variable-length characters, in a regular expression, use * to represent any character (including 0 characters), use + to represent at least one character, and use? 0 or 1 characters, {n} represents n characters, and {n, m} represents n-m characters:

Let's take a complex example: \ d {3} \ s + \ d {3, 8 }.

Let's explain from left to right:

\ D {3} indicates that three numbers are matched, for example, '010 ';

\ S can match a space (including tabs and other blank characters), SO \ s + indicates at least one space, such as matching ''and;

\ D {3, 8} indicates 3-8 numbers, for example, '123 '.

In combination, the above regular expression can match a telephone number with a zone number separated by any space.

What if I want to match a number like '010-12345? Because '-' is a special character, it must be escaped using '\' in a regular expression. Therefore, the above regular expression is \ d {3} \-\ d {3 }.

However, it still cannot match '010-12345 'because it contains spaces. Therefore, we need more complex matching methods.

Advanced

To perform more precise matching, you can use [] to indicate the range, for example:

[0-9a-zA-Z \ _] can match a number, letter, or underline;

[0-9a-zA-Z \ _] + can match strings consisting of at least one digit, letter, or underline, such as 'a100', '0 _ Z', and 'py3000;

[A-zA-Z \ _] [0-9a-zA-Z \ _] * can match a string starting with a letter or underline followed by any number, letter, or underline, that is, the Python valid variable;

[A-zA-Z \ _] [0-9a-zA-Z \ _] {0, 19} more precisely, the variable length is limited to 1-20 characters (the first 1 character + the last 19 characters at most ).

A | B can match A or B, so [P | p] ython can match 'python' or 'python '.

^ Indicates the beginning of a row, and ^ \ d indicates that it must start with a number.

$ Indicates the end of the row, and \ d $ indicates that the end must be a number.

You may have noticed that py can also match 'python', but adding ^ py $ will change to a full line match, so you can only match 'py.

Re Module

With the preparation knowledge, we can use regular expressions in Python. Python provides the re module, including all regular expressions. Because the Python string itself uses \ escape, pay special attention to the following:

?

1

2

3

S = 'abc \-001' # Python string

# The Corresponding Regular Expression string is changed:

# 'Abc \-001'

Therefore, we strongly recommend that you use the r prefix of Python so that you do not need to consider escaping:

?

1

2

3

S = r'abc \-001' # Python string

# The Corresponding Regular Expression string remains unchanged:

# 'Abc \-001'

First, let's see how to determine whether the regular expression matches:

?

1

2

3

4

5

>>> Import re

>>> Re. match (R' ^ \ d {3} \-\ d {3, 8} $ ', '010-12345 ')

<_ Sre. SRE_Match object at 0x1026e18b8>

>>> Re. match (R' ^ \ d {3} \-\ d {3, 8} $ ', '010 12345 ')

>>>

The match () method is used to determine whether a Match exists. If the match succeeds, a Match object is returned. Otherwise, None is returned. Common judgment methods are:

?

1

2

3

4

5

Test = 'user-input string'

If re. match (r'regular expression', test ):

Print 'OK'

Else:

Print 'failed'

Split string

Splitting strings with regular expressions is more flexible than using fixed characters. Please refer to the normal splitting code:

?

1

2

>>> 'A B C'. split ('')

['A', 'B', '','', 'C']

Well, we cannot identify consecutive spaces. Try using regular expressions:

?

1

2

>>> Re. split (R' \ s + ', 'a B C ')

['A', 'B', 'C']

No matter how many spaces are separated. Join and try:

?

1

2

>>> Re. split (R' [\ s \,] + ', 'a, B, c D ')

['A', 'B', 'C', 'D']

Try again:

?

1

2

>>> Re. split (R' [\ s \, \;] + ', 'a, B; c D ')

['A', 'B', 'C', 'D']

If you have entered a group of tags, remember to use a regular expression to convert the nonstandard input into a correct array next time.

Group

In addition to a simple match, regular expressions also provide powerful functions to extract substrings. () Indicates the Group to be extracted ). For example:

^ (\ D {3})-(\ d {3, 8}) $ defines two groups respectively. You can extract the area code and local number from the matching string directly:

?

1

2

3

4

5

6

7

8

9

>>> M = re. match (R' ^ (\ d {3})-(\ d {3, 8}) $ ', '010-12345 ')

>>> M

<_ Sre. SRE_Match object at 0x1026fb3e8>

>>> M. group (0)

'010-12345'

>>> M. group (1)

'0'

>>> M. group (2)

'123'

If a group is defined in a regular expression, the substring can be extracted using the group () method on the Match object.

Note that group (0) is always the original string, group (1), group (2 )...... Indicates 1st, 2 ,...... Substring.

It is very useful to extract substrings. Let's look at a more ferocious example:

?

1

2

3

4

>>> T = '19: 05: 30'

>>> M = re. match (R' ^ (0 [0-9] | 1 [0-9] | 2 [0-3] | [0-9]) \ :( 0 [0-9] | 1 [0-9] | 2 [0-9] | 3 [0-9] | 4 [0-9] | 5 [0 -9] | [0-9]) \ :( 0 [0-9] | 1 [0-9] | 2 [0-9] | 3 [0-9] | 4 [0-9] | 5 [0 -9] | [0-9]) $ ', t)

>>> M. groups ()

('19', '05 ', '30 ')

This regular expression can directly identify valid time. However, in some cases, regular expressions cannot be used for full verification, such as date identification:

?

1

'^ (0 [1-9] | 1 [0-2] | [0-9]) -(0 [1-9] | 1 [0-9] | 2 [0-9] | 3 [0-1] | [0-9]) $'

Invalid dates like '2-30' and '4-31 'cannot be identified by regular expressions, or are difficult to write. In this case, the program must be used for identification.

Greedy match

In the end, we must note that regular expression matching is greedy by default, that is, matching as many characters as possible. For example, match the value 0 after the number:

?

1

2

>>> Re. match (R' ^ (\ d +) (0 *) $ ', '123'). groups ()

('20140901 ','')

Because \ d + uses greedy match, all the following 0 is matched directly, and the result 0 * can only match null strings.

The \ d + must adopt non-Greedy match (that is, as little as possible) to match the following 0 and add? We can make \ d + adopt non-Greedy match:

?

1

2

>>> Re. match (R' ^ (\ d + ?) (0 *) $ ', '123'). groups ()

('20140901', '00 ')

Compile

When we use a regular expression in Python, the re module does two tasks internally:

Compile a regular expression. If the string of the regular expression is invalid, an error is returned;

Use the compiled regular expression to match the string.

If a regular expression needs to be reused several thousand times, we can pre-compile the regular expression for efficiency reasons. Then, we do not need to compile this step when repeating the regular expression. We will directly match the regular expression:

?

1

2

3

4

5

6

7

8

>>> Import re

# Compilation:

>>> Re_telephone = re. compile (R' ^ (\ d {3})-(\ d {3, 8}) $ ')

# Use:

>>> Re_telephone.match ('010-12345 '). groups ()

('010 ', '123 ')

>>> Re_telephone.match ('010-8086 '). groups ()

('010 ', '123 ')

The Regular Expression object is generated after compilation. Because the object itself contains a Regular Expression, you do not need to provide a Regular string when calling the corresponding method.

Summary

Regular Expressions are very powerful, and it is impossible to finish it in a short section. To understand all the regular expressions, you can write a thick book. If you often encounter regular expressions, you may need a reference book for regular expressions.

Please try to write a regular expression to verify the Email address. Version 1 should be able to verify a similar Email:

?

1

2

3

Someone@gmail.com

Bill.gates@microsoft.com

Try

Version 2 verifies and extracts the Email address with the name:

?

1

Tom@voyager.org <Tom Paris>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.