This article is written primarily for beginners who do not have the experience of using regular expressions.
Reprint Please specify the source
Intermediate Introduction Sub-expressions, looking forward and backward referencing links: http://www.cnblogs.com/chuxiuhong/p/5907484.html
Definition is a definition, too serious can not be used. Let's raise a chestnut: if you're writing a reptile, you get
HTML source for a Web page. There is one section
You want to extract this Hello world, but if you just do Python string processing, the first reaction might be
Dry Start Entry levelLet's go on to the example that we just made. What do we do with this expression if we take the regular?
import rekey = r " #这段是你要匹配的文本p1 = r "(?<= #这是我们写的正则表达式规则, you can not understand what it means pattern1 compile (p1) #我们在编译这段正则表达式matcher1 = Re.search (pattern1,key) #在源文本中搜索符合正则表达式的部分 print matcher1.group (0) #打印出来
You can try to run the code above and see if it's the same as we imagined (bloggers are in the python2.7 environment) find the code is very little quite simple? Look down. And the regular expression is actually much simpler than the odd-looking grotesque.
First, start with the most basic regular expression.
Suppose our idea is to match all of the "Python" in a string to. Let's try what we can do.
import Rekey = r "JAVAPYTHONHTMLVHDL" #这是源文本p1 = r "python" #这是我们写的正则表达式pattern1 = re. Compile (p1) #同样是编译matcher1 = re.search (Pattern1,key) #同样是查询 print Matcher1.group (0)
After reading this piece of code, do you think: lying trough? This is the regular expression? Just write it straight up.
Indeed, the regular expression is not as wonderful as it is on the surface, and if it is not that we deliberately change the meaning of some symbols, you see that you want to match.
So, first empty the brain, first think that the regular expression is the same as the string you want to match the same length. In the following exercises, we will evolve gradually.
Primary0. Both Python and regular expressions are case-sensitive , so when you replace "python" with "Python" in the example above, it won't match your beloved Python.
1. Go back to the match in the first example . If I write like this, what will happen?
import rekey = r"#源文本p1 = r"#我们写的正则表达式,下面会将为什么pattern1 = re.compile(p1)print pattern1.findall(key)#发没发现,我怎么写成findall了?咋变了呢?
With an entry-level experience, we know that the two are ordinary characters, but what is the middle one?
.
The word regular expression represents any character that can be represented (including itself)
FindAll returns a list of all the elements that meet the requirements, including only one element, or the list that is returned to you.
Wit as you may suddenly ask: then I just want to match "." It? And all the results returned to me. There is a character in the regular expression \
, in fact, if you have more programming experience, you will find that this is a lot of places "escape character." In regular expressions, this symbol is usually used to turn special symbols into ordinary, the ordinary turn into a special 23333 (not a special "2333", written to find out if there will be a big brain hole to think crooked).
For a chestnut, you really want to match "[email protected]" This mailbox (my mailbox), you can write the regular expression like this:
import rekey = r"[email protected]"p1 = r"[email protected]\.edu\.cn"pattern1 = re.compile(p1)print pattern1.findall(key)
Found it, we are in .
is preceded by the escape character \
, but does not represent the match "\." Instead, only matches "." The meaning!
do not know that you are not careful, have not found us the first time with .
, followed by a +
? What does the plus sign do?
It's not really hard to think, we said ".
The word regular expression means that it can represent any character, including itself, "but Hello World is not a character." The purpose of
+
is to repeat the preceding character or a subexpression one or more times.
For example the expression "ab+" then it can match to "abbbbb", but does not match to "a", it requires you have to have a B, more unlimited, less. If you ask me if there is any way of saying "there is no line, there are many ways to express it," the answer is yes.
*
followed by another symbol can match to it 0 or more times
For example, we encountered a link in the Wang Yene, may have both HTTP//start, and https://beginning, we how to deal with?
import rekey = r"http://www.nsfbuhwe.com and https://www.auhfisna.com"#胡编乱造的网址,别在意p1 = r"https*://"#看那个星号!pattern1 = re.compile(p1)print pattern1.findall(key)
Output
[‘http://‘, ‘https://‘]
2. Let's say we have a string "Cat hat Mat Qat", you'll find that the first three are actual words, and the last one I made up (Baidu is the Queensland English institute abbreviation =). If you knew that "at" was preceded by C, H, and M, this would make up the word, and you would want to match it. Based on what you've learned, would you expect to write three regular expressions to match? Actually, no need. Because there is a multi-character approach
[]
Represents any one of the characters in the match
Or to raise a chestnut, we found Ah, some programmers more than,, in the label, mixed case, the old harm we can't catch what we want, how should we deal with it? Is it a 16*16-type regular expression match? No
import rekey = r"lalala= r"<[Hh][Tt][Mm][Ll]>.+?</[Hh][Tt][Mm][Ll]>"pattern1 = re.compile(p1)print pattern1.findall(key)
Output
[‘</Html>‘]
Since we have a range of matches, we naturally have a range of exclusions.
[^]
Represents a match in addition to the internally contained characters
Or Cat,hat,mat,qat This example, we want to match the other than the qat, then we should write:
import rekey = r"mat cat hat pat"p1 = r"[^p]at"#这代表除了p以外都匹配pattern1 = re.compile(p1)print pattern1.findall(key)
Output
To make it easier for us to write concise regular expressions, it also provides the following notation
Regular Expressions | The
matching character represented by |
[0-9] |
0123456789 any one |
[A-z] |
Any of the lowercase letters |
[A-z] |
Any one of the Capitals |
\d |
equivalent to [0-9] |
\d |
equivalent to [^0-9] matches non-numeric |
\w |
equal to [a-z0-9a-z_] match uppercase and lowercase letters, numbers, and underscores |
\w |
equivalent to [^a-z0-9a-z_] equals to the previous take-off |
3. Introduction here, we may have mastered the general expression of the structure of the pattern, but we often encountered in the actual combat some of the wrong matching. Say:
import rekey = r"[email protected]"p1 = r"@.+\."#我想匹配到@后面一直到“.”之间的,在这里是hitpattern1 = re.compile(p1)print pattern1.findall(key)
Output results
[‘@hit.edu.‘]
Oh, yo! How can you do more? My ideal result is @hit.
, how can you give me to add quantity? This is because the regular expression is "greedy" by default, as we have said before, "+" means that the character repeats one or more times. But we didn't dwell on how many times this was. So it will give us more "greedy" to match the characters, in this case the match to the last "."
How do we solve this problem? Just add one after "+"? "Just fine."
import rekey = r"[email protected]"p1 = r"@.+?\."#我想匹配到@后面一直到“.”之间的,在这里是hitpattern1 = re.compile(p1)print pattern1.findall(key)
Output results
[‘@hit.‘]
Added a "?" We changed the greedy "+" to the lazy "+". This is also true for [abc]+,\w* and the like].
Quiz: The above example can not use lazy match, think of a way to get the same result
* * Personal advice: When you use the "+", "*", you must first think about whether the greedy type or lazy type, especially when you use a larger range of items, because it is likely to be more matching characters back to you!!! **
To be able to accurately control the number of repetitions, regular expressions also provide
{A, B} (represents a<= match count <=b)
Or a chestnut, we have Sas,saas,saaas, we want SAS and SaaS, how do we deal with it?
import rekey = r"saas and sas and saaas"p1 = r"sa{1,2}s"pattern1 = re.compile(p1)print pattern1.findall(key)
Output
[‘saas‘, ‘sas‘]
If you omit 2 from {$}, then it represents at least one match, then it is equivalent to?
If you omit 1 from {$}, then it represents a maximum of 2 matches.
Here are some examples of metacharacters and their effects in regular expressions
Meta character |
Description |
. |
Represents any character |
| |
Logical OR operator |
[ ] |
Match any of the inner characters or sub-expressions |
[^] |
For character sets and fetching non- |
- |
Define an interval |
\ |
To the next word Fu Yingfi (usually ordinary to special, special to ordinary) |
* |
Match previous characters or sub-expressions 0 or more times |
*? |
Lazy Match Previous |
+ |
Matches the previous character or subexpression one or more times |
+? |
Lazy Match Previous |
? |
Match the previous character or subexpression 0 or 1 repetitions |
N |
Match the previous character or sub-expression |
{M,n} |
Matches the previous character or subexpression at least m times up to N times |
{N,} |
Matches the previous character or sub-expression at least n times |
{N,}? |
Lazy Match of the previous one |
^ |
Matches the beginning of a string |
\a |
Match string start |
$ |
Match string End |
[\b] |
BACKSPACE characters |
\c |
Match a control character |
\d |
Match any number |
\d |
Match a character other than a number |
\ t |
Match tabs |
\w |
Match any digit letter underline |
\w |
Do not match digit letter underline
|
Getting started with Python regular expressions