Explore regular Expressions in Python

Source: Internet
Author: User
Tags data structures regular expression split valid

This article mainly introduces some uses of regular expressions in Python, the use of regular expressions is an important knowledge of Python learning, the need for friends can refer to the

Strings are the most data structures involved in programming, and the need to manipulate strings is almost everywhere. For example, to determine whether a string is a legitimate email address, although it can be programmed to extract the substring before and after, and then judge whether the word and domain name, but this is not only troublesome, but also difficult to reuse code.

A regular expression is a powerful weapon used to match strings. Its design idea is to define a rule in a descriptive language, in which the string that conforms to the rule is considered "matched", otherwise the string is illegal.

So the way we judge whether a string is valid by email is:

Create a regular expression that matches the email;

The regular expression is used to match the user's input to determine whether it is legal.

Because regular expressions are also represented by strings, we first know how to use characters to describe a character.

In regular expressions, if you give a character directly, it is an exact match. D can match a number, W can match a letter or number, so:

' 00d ' can match ' 007 ', but cannot match ' 00A ';

' DDD ' can match ' 010 ';

' WWD ' can match ' py3 ';

. You can match any character, so:

' py. ' Can match ' pyc ', ' pyo ', ' py! ' Wait a minute.

To match a variable length character, in a regular expression, use * to denote any character (including 0), with + representing at least one character, using the. Represents 0 or 1 characters, with {n} representing n characters and n-m characters with {n,m}:

Look at a complex example: d{3}s+d{3,8}.

Let's read from left to right:

D{3} to match 3 digits, such as ' 010 ';

S can match a space (also including tab and other blank characters), so S+ says there is at least one space, such as matching ', ' and so on;

d{3,8} represents 3-8 digits, such as ' 1234567 '.

In combination, the regular expression above can match a phone number with an area code separated by any space.

What if you want to match a number like ' 010-12345 '? Because '-' is a special character, in the regular expression, to use ' escape, so the top is d{3}-d{3,8}.

However, still cannot match ' 010-12345 ' because of a space. So we need more complex ways of matching.

Advanced

To do a more accurate match, you can use [] to represent a range, such as:

[0-9a-za-z_] can match a number, letter or underscore;

[0-9a-za-z_]+ can match at least a string of digits, letters, or underscores, such as ' A100 ', ' 0_z ', ' Py3000 ', and so on;

[A-za-z_] [0-9a-za-z_]* can match the beginning of a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid variable of Python;

[A-za-z_] [0-9a-za-z_] {0, 19} More precisely limits the length of a variable to 1-20 characters (preceded by 1 characters + followed by up to 19 characters).

a| B can match A or b, so [P|p]ython can match ' python ' or ' python '.

^ Represents the beginning of a line, and the ^d representation must begin with a number.

$ indicates the end of a line and d$ must end with a number.

You may have noticed that Py can also match ' python ', but adding ^py$ becomes a whole line match and can only match ' py '.

Re module

With the knowledge of preparation, we can use regular expressions in Python. Python provides the RE module, which contains the functionality of all regular expressions. Because the Python string itself is also escaped, pay special attention to:

?

1 2 3 s = ' ABC-001 ' # python string # The corresponding regular expression string becomes: # ' ABC-001 '

So we strongly recommend using the R prefix of Python without having to consider the escaping problem:

?

1 2 3 s = R ' ABC-001 ' # python string # The corresponding regular expression string does not change: # ' ABC-001 '

Let's look at how to determine whether a regular expression matches:

?

1 2 3 4 5 >>> Import re >>> re.match (R ' ^d{3}-d{3,8}$ ', ' 010-12345 ') <_sre. Sre_match object at 0x1026e18b8> >>> re.match (R ' ^d{3}-d{3,8}$ ', ' 010 12345 ') >>>

The match () method determines whether a match is made, returns a match object if the match succeeds, or none. Common methods of judgment are:

?

1 2 3 4 5 Test = ' user input string ' if Re.match (R ' Regular expression ', test): print ' OK ' else:print ' failed '

Splitting strings

Splitting strings with regular expressions is more flexible than using fixed characters, see the normal shard code:

?

1 2 >>> ' a b C '. Split (') [' A ', ' B ', ', ', ', ', ', ', ', ', ' ', ', ' C ']

Well, you can't recognize a contiguous space, try a regular expression:

?

1 2 >>> Re.split (R ' s+ ', ' a b C ') [' A ', ' B ', ' C ']

No matter how many spaces can be split normally. Join, try:

?

1 2 >>> Re.split (R ' [s,]+ ', ' a,b, C d ') [' A ', ' B ', ' C ', ' d ']

to join in; try again:

?

1 2 >>> Re.split (R ' [s,;] + ', ' a,b;; C d ') [' A ', ' B ', ' C ', ' d ']

If the user enters a set of labels, next time remember to use a regular expression to convert the nonstandard input into the correct array.

Group

In addition to simply deciding whether or not to match, regular expressions also have a powerful ability to extract substrings. represented by () is the Grouping (group) to extract. Like what:

^ (d{3})-(d{3,8}) $ two groups are defined to extract the area code and local number directly from the matching string:

?

1 2 3 4 5 6 7 8 9 >>> m = Re.match (R ' ^ (d{3})-(d{3,8}) $ ', ' 010-12345 ') >>> m <_sre. Sre_match object at 0x1026fb3e8> >>> m.group (0) ' 010-12345 ' >>> m.group (1) ' 010 ' >>> M.grou P (2) ' 12345 '

If a group is defined in a regular expression, you can extract the substring from the group () method on the match object.

Note that group (0) is always the original string, group (1), Group (2) ... Represents the 1th, 2 、...... A substring.

Extracting substrings is useful. Look at a more brutal example:

?

1 2 3 4 >>> t = ' 19:05:30 ' >>> m = re.match (R ' ^ (0[0-9]|1[0-9]|2[0-3]|[ 0-9]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]):( 0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) $ ', T >>> m.groups () (' 19 ', ' 05 ', ' 30 ')

This regular expression can directly identify the legitimate time. Sometimes, however, a regular expression cannot be fully validated, such as the date of identification:

?

1 ' ^ (0[1-9]|1[0-2]| [0-9]) -(0[1-9]|1[0-9]|2[0-9]|3[0-1]| [0-9]) $'

For the ' 2-30 ', ' 4-31 ' such an illegal date, with a positive or not to identify, or write out very difficult, then need to program with the identification.

Greedy match

Last but not least, a regular match is a greedy match by default, which is to match as many characters as possible. For example, match the 0 following the number:

?

1 2 >>> Re.match (R ' ^ (d+) (0*) $ ', ' 102300 '). Groups () (' 102300 ', ')

Since d+ uses greedy matching, the following 0 are all matched directly, and the result 0* can only match the empty string.

D+ must be allowed to use a non greedy match (that is, to match as little as possible) in order to match the next 0, add a? To allow d+ to use a non-greedy match:

?

1 2 >>> Re.match (R ' ^ (d+?) (0*) $ ', ' 102300 '). Groups () (' 1023 ', ' 00 ')

Compile

When we use regular expressions in Python, two things are done inside the RE module:

Compiles a regular expression, and if the string of the regular expression itself is illegal, an error occurs;

Matches a string with a compiled regular expression.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression, and then reuse it without having to compile this step to match directly:

?

1 2 3 4 5 6 7 8 >>> Import re # compiling: >>> Re_telephone = re.compile (R ' ^ (d{3})-(d{3,8}) $ ') # using: >>> re_telephone. Match (' 010-12345 '). Groups () (' 010 ', ' 12345 ') >>> re_telephone.match (' 010-8086 '). Groups (' 010 ', ' 8086 ')

Generates regular Expression objects after compilation, because the object itself contains regular expressions, so call the corresponding method without giving a regular string.

Summary

Regular expressions are so powerful that it is impossible to finish them in a short section. You can write a thick book to clarify all the contents of the regular. If you often encounter problems with regular expressions, you may need a reference book of regular expressions.

Try writing a regular expression that verifies your email address. Version one should be able to verify a similar email:

?

1 2 3 Someone@gmail.com bill.gates@microsoft.com Try

Version two can verify and extract the email address with the name:

?

1 <tom paris> tom@voyager.org
Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.