Explanation of Regular Expressions in Python, python Regular Expressions
Basics
Regular Expressions are widely used in python because they can be used for any matching and match the information we want to extract. When we contact regular expressions, you will know the strong regular expressions. Regular expressions have a library re. In some projects, we often call regular expressions to solve matching problems.
String is the most involved data structure during programming, and the demand for string operations is almost everywhere. For example, to determine whether a string is a valid Email address, although it can be programmed to extract the substring before and after @ and then determine whether it is a word or a domain name separately, this is not only troublesome, but also difficult to reuse the code.
Regular Expressions are powerful weapons used to match strings. It is designed to define a rule for a string in a descriptive language. Any character string that complies with the rule will be considered "matched". Otherwise, this string is invalid.
Therefore, we can determine whether a string is valid by Email:
1. Create a regular expression that matches the Email;
2. Use this regular expression to match the user's input to determine if it is legal.
Because regular expressions are also represented by strings, we need to first understand how to use characters to describe characters.
In a regular expression, if a character is directly given, exact match is performed. Use \ d to match a number, and \ w to match a letter or number, so:
•'00\d'
Can match '007 ', but cannot match '00a ';
•'\d\d\d'
It can match '010 ';
•'\w\w\d'
Match 'py3 ';
.
Can match any character, so:
•'py.'
It can match 'pya', 'pyb', and 'py! 'And so on.
To match variable-length characters, in a regular expression, use * to represent any character (including 0 characters), use + to represent at least one character, and use? 0 or 1 characters, {n} represents n characters, and {n, m} represents n-m characters:
Let's take a complex example: \ d {3} \ s + \ d {3, 8 }.
Let's explain from left to right:
1. \d{3}
Matches three numbers, for example, '010 ';
2.\s
It can match a space (including tabs and other blank characters), SO \ s + indicates at least one space, for example, matching '','', etc;
3.\d{3,8}
3-8 digits, for example, '20140901 '.
In combination, the above regular expression can match a telephone number with a zone number separated by any space.
What if I want to match a number like '010-12345? Because '-' is a special character, it must be escaped using '\' in a regular expression. Therefore, the above regular expression is \ d {3} \-\ d {3 }.
However, it still cannot match '010-12345 'because it contains spaces. Therefore, we need more complex matching methods.
Enhancement
To perform more precise matching, you can use [] to indicate the range, for example:
•[0-9a-zA-Z\_]
It can match a number, letter, or underline;
•[0-9a-zA-Z\_]+
It can match strings consisting of at least one digit, letter, or underline, such as 'a100', '0 _ Z', and 'py3000;
• [a-zA-Z\_][0-9a-zA-Z\_]*
It can match a string that starts with a letter or underline followed by any number, letter, or underline, that is, a Python valid variable;
•[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}
More precisely, the variable length is limited to 1-20 characters (the first 1 character + the last 19 characters at most ).
A|B
It can match A or B, so (P | p) ython can match 'python' or 'python '.
^
Indicates the beginning of a row, ^\d
It must start with a number.
$
Indicates the end of the row,\d$
End with a number.
You may have noticed that py can also match 'python', but adding ^ py $ will change to a full line match, so you can only match 'py.
Re Module
With the preparation knowledge, we can use regular expressions in Python. Python provides the re module, including all regular expressions. Because the Python string itself uses \ escape, pay special attention to the following:
S = 'abc \-001' # Python string # the corresponding regular expression string is changed to: # 'abc \-001'
Therefore, we strongly recommend that you use the r prefix of Python so that you do not need to consider escaping:
S = r 'abc \-001' # Python string # the corresponding regular expression string remains unchanged: # 'abc \-001'
First, let's see how to determine whether the regular expression matches:
>>> import re>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')<_sre.SRE_Match object at 0x1026e18b8>>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')>>>
The match () method is used to determine whether a Match exists. If the match succeeds, a Match object is returned. Otherwise, None is returned. Common judgment methods are:
Test = 'user-input string 'if re. match (r'regular expression', test): print 'OK' else: print 'failed'
Split string
Splitting strings with regular expressions is more flexible than using fixed characters. Please refer to the normal splitting code:
>>> 'a b c'.split(' ')['a', 'b', '', '', 'c']
Well, we cannot identify consecutive spaces. Try using regular expressions:
>>> re.split(r'\s+', 'a b c')['a', 'b', 'c']
No matter how many spaces are separated. Join and try:
>>> re.split(r'[\s\,]+', 'a,b, c d')['a', 'b', 'c', 'd']
Try again:
>>> re.split(r'[\s\,\;]+', 'a,b;; c d')['a', 'b', 'c', 'd']
If you have entered a group of tags, remember to use a regular expression to convert the nonstandard input into a correct array next time.
Group
In addition to a simple match, regular expressions also provide powerful functions to extract substrings. () Indicates the Group to be extracted ). For example:
^(\d{3})-(\d{3,8})$
Two groups are defined respectively. You can extract the area code and local number from the matching string directly:
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')>>> m<_sre.SRE_Match object at 0x1026fb3e8>>>> m.group(0)'010-12345'>>> m.group(1)'010'>>> m.group(2)'12345'
If a group is defined in a regular expression, the substring can be extracted using the group () method on the Match object.
Note that group (0) is always the original string, group (1), group (2 )...... Indicates 1st, 2 ,...... Substring.
It is very useful to extract substrings. Let's look at a more ferocious example:
>>> t = '19:05:30'>>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)>>> m.groups()('19', '05', '30')
This regular expression can directly identify valid time. However, in some cases, regular expressions cannot be used for full verification, such as date identification:
'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'
Invalid dates like '2-30' and '4-31 'cannot be identified by regular expressions, or are difficult to write. In this case, the program must be used for identification.
Greedy match
In the end, we must note that regular expression matching is greedy by default, that is, matching as many characters as possible. For example, match the value 0 after the number:
>>> re.match(r'^(\d+)(0*)$', '102300').groups()('102300', '')
Because \ d + uses greedy match, all the following 0 is matched directly, and the result 0 * can only match null strings.
The \ d + must adopt non-Greedy match (that is, as little as possible) to match the following 0 and add? We can make \ d + adopt non-Greedy match:
>>> re.match(r'^(\d+?)(0*)$', '102300').groups()('1023', '00')
Compile
When we use a regular expression in Python, the re module does two tasks internally:
1. Compile a regular expression. If the string of the regular expression is invalid, an error is returned;
2. Use the compiled regular expression to match the string.
If a regular expression needs to be reused several thousand times, we can pre-compile the regular expression for efficiency reasons. Then, we do not need to compile this step when repeating the regular expression. We will directly match the regular expression:
>>> Import re # compile: >>> re_telephone = re. compile (R' ^ (\ d {3})-(\ d {3, 8}) $ ') # Use: >>> re_telephone.match ('010-12345 '). groups () ('010 ', '000000') >>> re_telephone.match ('010-12345 '). groups () ('010 ', '123 ')
The Regular Expression object is generated after compilation. Because the object itself contains a Regular Expression, you do not need to provide a Regular string when calling the corresponding method.
Summary
Regular Expressions are very powerful, and it is impossible to finish it in a short section. To understand all the regular expressions, you can write a thick book. If you often encounter regular expressions, you may need a reference book for regular expressions.
Please try to write a regular expression to verify the Email address. Version 1 should be able to verify a similar Email:
Someone@gmail.com
Demonzjs93@gmail.com
To sum up the commonly used matching characters in python:
\w
It can match a letter or number.
\d
Matching number
\d+
Multiple numbers can be matched.
\d+?
Matching a part of numbers (A group)
^
Match the beginning of a row
$
Match the end of a row
^\d
The first must be a number.
\d$
The last one must be a number.
\s
Match a space
\d{3,8}
Match 3-8 numbers
[0-9a-zA-Z\_]
It can match a number, letter, or underline;
[0-9a-zA-Z\_]+
It can match strings consisting of at least one digit, letter, or underline, such as 'a100', '0 _ Z', and 'py3000;
[a-zA-Z\_][0-9a-zA-Z\_]*
It can match a string that starts with a letter or underline followed by any number, letter, or underline, that is, a Python valid variable;
[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}
More precisely, the variable length is limited to 1-20 characters (the first 1 character + the last 19 characters at most ).
.
Match any character
*
Match any character (including 0 characters)
?
Match 0 or 1 Character
+
Match at least one character
{n}
N characters
{n,m}
N-m characters
>>>'Demon is a good %s' % ('boy')'Demon is a good boy'
As long as you can use the above matching characters for clinker, you can use regular expressions for many functions in the future. When you do it, you will know that regular expressions are powerful.