Why use regular expressions
Manipulating strings is one of the most important features in almost every programming language. It is very simple to understand, because the human information is transmitted mainly by the text, that is, the string, but so much of the information is not exactly what we want, so we will be programmed to extract or validate the part of the string.
A regular expression is a tool used to match a string , in fact it defines a set of syntax that can be used to match the character of a string with several descriptive characters. In the case of a description rule, we think it matches .
So for example, if we want to determine whether a string of characters is a legitimate email address, the method is:
- Create an email-compliant
正则表达式
- The regular expression is then used to match the input string to determine whether it is legal.
Regular expression meta-characters
Use \d to match a number, \w can match a letter or a number
Meta character |
Match |
. |
Any character ( but not including line break \n\r, etc. ) |
\w |
Letter or number or underline |
\s |
Whitespace characters (including tab, etc.) |
\d |
Digital |
An example ‘py.‘
can be matched ‘pyc‘
, ‘pyo‘
and ‘py!‘
so on. Because .
it represents any character, it can match the normal letter, or it can match!
Note that a meta-character represents only one character, such as \w, which represents only one letter or number.
You can use []
a representation range, for example, to [0-9]
represent any number between matching 0~9.
[0-9a-zA-Z\_]
Can match a number, letter, or underscore, which can be equivalent to\w
Sometimes you need to find characters that are not part of a character class that can be easily defined, which is反义
Code/Syntax |
Match |
[^x] |
Any character other than X |
[^aeiou] |
Any character other than a few letters AEIOU |
Match a variable length
If a good match is a character that is longer, use a character representing *
0 or more characters, representing +
1 or more characters, with a ?
representation of 0 or 1 characters.
You can also use curly braces to represent n characters with {n}, and {n,m} to represent n-m characters.
Code/Syntax |
Description |
* |
Repeat more than 0 times, equivalent to {0,} |
+ |
Repeat more than 1 times, equivalent to {1,} |
? |
Repeat 0 or 1 times, equivalent to {0,1} |
N |
Repeat n times |
{N,} |
Repeat more than n times |
{N,m} |
Repeat N to M times |
So \d{3}\s+\d{3,8}
what types of strings can be matched, for example?
Read from left to right:
- \d{3} indicates a match of 3 digits, e.g. ' 010 ';
- \s can match a space (also including tab and other whitespace), so \s+ indicates at least one space, such as "," and so on;
- \d{3,8} represents 3-8 digits, such as ' 1234567 '.
What if I want to match a number like ' 010-12345 '? Because '-' is a special character, in the regular expression, to be escaped with ' ', so, the above is \d{3}-\d{3,8}.
[0-9a-zA-Z\_]+
Can match a string of at least one number, letter, or underscore, for example, and ‘a100‘
‘0_Z‘
‘Py3000‘
so on;
[a-zA-Z\_][0-9a-zA-Z\_]*
It can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid Python variable;
[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}
More precisely limit the length of a variable to 1-20 characters (1 characters before + 19 characters later).
Note and 通配符
differentiate, the Linux Bash command line can be used *
to proxy any character in a wildcard. For a regular expression, you must use it .*
to represent any character
So for the example of the previous phone number, we can use more complex expressions to match \(?0\d{2}[) -]?\d{8}。\(?0\d{2}[) -]?\d{8}。
, can match (010) 88886666, or 022-22334455, or 02912345678, and so on.
- The first is an escape character (, it can occur 0 or 1 times (?),
- Then there is a 0, followed by 2 digits (\d{2}),
- then yes) or-or one of the spaces, it appears 1 times or does not appear (?),
- And finally 8 numbers (\d{8}
But this expression can also match 010) 12345678 or (022-87654321 such an "incorrect" format. Later, we will say how to modify the problem can be solved.
Boundary qualifier
Boundary Limit |
Match |
^ |
Start of string |
$ |
End of string |
For example, the expression starts with a number, ends with a ^\d{5,12}$
number, matches the entire line, and the length is a string of digits in the 5~12.
Branch conditions
The so-called branching condition is similar to the logic "or", satisfies any one condition namely matches. The specific method is |
to separate the different rules.
Examples of matching phone numbers you've talked about before, for example.
0\d{2}-\d{8}|0\d{3}-\d{7}
This expression can match
- Three-bit area code, 8-bit local number (such as 010-12345678),
- 4-bit Area code, 7-bit local number (0376-2233445).
\(0\d{2}\)[- ]?\d{8}|0\d{2}[- ]?\d{8}
: This expression is |
divided into two conditions
- The expression on the left:
\(0\d{2}\)
can be matched (010), [- ]?
indicating that the connector can be either a -
space interval or not.
- Expression on the right
0\d{2}[- ]?\d{8}
: indicates that the area code is not enclosed in parentheses.
Note: When you match a branching condition, each condition is tested from left to right, and if a branch is satisfied, you will not be able to control the other conditions.
Group
Previously mentioned is how to repeat a single character (directly after the character with a qualifier on the line);
But what if you want to repeat multiple characters ? You can 小括号
specify a sub-expression (also called a grouping), and you can specify the number of repetitions of the sub-expression.
For example (\d{1,3}\.){3}\d{1,3}
, it can be analyzed sequentially,
- \d{1,3} matches numbers from 1 to 3 digits,
- (\d{1,3}.) {3} matches three digits plus an English period (this whole is the group) repeats 3 times,
- Finally, add one to three digits (\d{1,3}).
Summarize
I believe that the sudden emergence of such a symbol of the people must be ignorant force. Let's summarize what {}
[]
()
These symbols are used for.
{2,3}
: It needs to be combined with the characters in front of it, such as a{2,3}
a
two or three occurrences
[]
: There are 3 layers of meaning
[a-z]
: Represents a range, that a~z
is, the 一个
character between
[.*]
: As long as the []
inside .*
does not mean the meaning of the previous, but simply as a normal symbol just. For example, there is a sign that either is or is 点号
星号
.
[^a]
: 非a
all characters represented. Primarily do not and ^a
confuse, ^a
expressed as a
the beginning of a line.
Greedy Match and Lazy match
In a.*b
other words, it will match the longest string starting with a and ending with B, for example, when searching for Aabab, it will match the entire string aabab, that is 贪婪匹配
, as many matches as possible.
That 懒惰匹配
means as few matching characters as possible . .*
after adding one ?
later, you can convert to lazy matching mode, which .*?
means that the minimum repetition is used if the match succeeds. For example, applying it to Aabab will match AaB and AB.
Why is the first match a aab instead of AB? Because the regular expression has a rule: the first match has the highest priority
Code/syntax |
*? |
+? |
?? |
{n,m}? |
{N,}? |
Match Chinese characters
The expression of matching Chinese characters [\u4E00-\u9FA5]
is, this is the range of UTF-8 encoding of Chinese characters.
Python calls the regular expression
Python provides the RE module, which contains the functionality of all regular expressions. Because the Python string itself is also escaped with \, pay special attention to:
For example s = ‘ABC\\-001‘
, the corresponding regular expression of a python string becomes‘ABC\-001‘
So it's best to prefix the Python string r
without having to consider escaping problems, such ass = r‘ABC\-001‘ # Python的字符串
How to tell if a regular expression matches:
- Introducing
re
Modules:import re
Using the match
method, if the match succeeds, returns a Match object, otherwise returns none
Test = ' user-entered string '
if re.match(r‘正则表达式‘, test):print(‘ok‘)else:print(‘failed‘)
Slicing a string
When you use regular expressions, the split character becomes more flexible.
If you use Split's normal segmentation code, you can see that consecutive spaces are not recognized
>>> ‘a b c‘.split(‘ ‘)[‘a‘, ‘b‘, ‘‘, ‘‘, ‘c‘]
Using regular expressions allows for more complex segmentation:
>>> re.split(r‘[\s\,\;]+‘, ‘a,b;; c d‘)[‘a‘, ‘b‘, ‘c‘, ‘d‘]
Group
In addition 判断是否匹配
to this, regular expressions can be 提取子串
a powerful feature. In () is the one that you want to extract.分组(Group)。
Like what
m = re.match(r‘^(\d{3})-(\d{3,8})$‘, ‘010-12345‘)
This regular expression defines two groupings that can match -
the two expressions before and after.
m.group(0)
: Get ' 010-12345 '
m.group(1)
: Get is "010"
m.group(2)
: Get is ' 12345 '
Group (0) is always the original string, group (1), Group (2) ... Represents the 1th, 2 、...... Substring.
Greedy match
The regular expression defaults to greedy matching. Like what
>>> re.match(r‘^(\d+)(0*)$‘, ‘102300‘).groups()#结果是(‘102300‘, ‘‘),\d+采用贪婪匹配,直接把后面的0全部匹配了,结果0*只能匹配空字符串了
You must let \d+ use a non-greedy match (that is, as few matches as possible) in order to match the back of the 0, add a? You can let the \d+ use a non-greedy match:
>>> re.match(r‘^(\d+?)(0*)$‘, ‘102300‘).groups()(‘1023‘, ‘00‘)
Another example
import re line = "boooooobby123";reg_str = ".*(b.*b).*";match_obj = re.match (reg_str , line);if match_obj: print (match_obj.group(1));
Because .*
it is a greedy match, so it will always match booooooboooooo
, then the parentheses actually only match thebb
If you are using non-greedy mode, which is .*
followed by adding a?
import re line = "boooooobby123";reg_str = ".*?(b.*?b).*";match_obj = re.match (reg_str , line);if match_obj:
Example: Extract Date
Below we want to be able to automate the extraction of a paragraph of text, 生日
but if the previous format is not specified, we would like to write the date, such as
- Born on January 23, 2018
- Born in 2018/1/23
- Born in 2018-1-23
- Born in 2018-01-23
- Born in 2018-01
- Born in January 2018
Below we need to give a regular expression that asks him to match all the date formats above.
First match the part of the year in the date, from the above text can be seen, only 2018年
, 2018-
,
2018/
These several forms. That is, you can first use \d{4}
the representation of numbers, and then use [年-\]
to represent symbols. Together, it's
regex = r"出生于(\d{4}[年/-])"
- A second look
月份
at the numbers section can only have 01
and 1
two forms:\d{1,2}
月份
The latter part is relatively complex. Similarly, we can classify them and then use branching conditions to express them uniformly.
- The
2018年1月23日
following parts of the match and 2018-01-23
as well 2018/1/23
月
:[月/-]\d{1,2}日?
- Match the
2018年01月
月
following parts of this:[月/-]$
- The
2018-01
following part of the match 月
is, of course, directly used 结尾符
:$
Finally ()
, it |
is used to classify and discuss.
([月/-]\d{1,2}日?|[月/-]$|$)
Finally, merge all the parts together.
import re lines = [ "出生于2018年1月23日", "出生于2018/1/23", "出生于2018-1-23", "出生于2018-01-23", "出生于2018-01", "出生于2018年01月"]regex = r"出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}日?|[月/-]$|$))"for line in lines : m = re.match(regex , line ) if m : print(m.group(1));
Compile
When using regular expressions, two things are done inside the RE module:
- Compile regular expressions, at which time the syntax analysis, if the expression itself is not legal, will be error;
Use the compiled regular expression to match the string.
Then if a regular expression is to be used very often, you can precompile the regular expression
# 编译:>>> re_telephone = re.compile(r‘^(\d{3})-(\d{3,8})$‘)# 使用:>>> re_telephone.match(‘010-12345‘).groups()(‘010‘, ‘12345‘)
Reference
Liaoche-Regular Expressions
Regular Expressions 30-minute introductory tutorial
"Python syntax" regular expression