Python Regular Expressions (beginner), python Regular Expressions

Last Update:2016-12-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

First, what is a regular expression?

Regular Expression, also known as Regular Expression, rule Expression, Regular Expression (English: Regular Expression, often abbreviated as regex, regexp or RE in code ), A concept of computer science. Regular Expressions use a single string to describe and match a series of strings that match a certain syntax rule. In many text editors, regular expressions are usually used to retrieve and replace texts that match a certain pattern.

Many programming languages Support string operations using regular expressions. For example, a powerful Regular Expression Engine is built in Perl. The concept of regular expressions was initially popularized by tools in Unix (such as sed and grep. Regular Expressions are abbreviated as "regex". The singular values include regexp and regex, And the plural values include regexps, regexes, and regexen.

Referenced from Wikipedia https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F

Definition is a definition, which is too serious to be used. Let's give an example: if you are writing a crawler, you get

The HTML source code of a webpage. There is a section

 You want to extract the hello world, but if you only process python strings, the first response may be
 s = 
Then find the next 
At this time, regular expressions are the preferred helper.
Dry Goods start
Entry Level
Next let's talk about the example we just mentioned. What should we do if we use a regular expression to process this expression?
Import rekey = r "
You can try to run the above Code to see if it is as simple as we think (the blogger is in the python2.7 environment? Look down. In addition, regular expressions are much simpler than the strange look.
First, start with the basic regular expression.
Suppose our idea is to match all "python" in a string. Let's give it a try.
Import rekey = r "javapythonhtmlvhdl" # This is the source text p1 = r "python" # This is the regular expression pattern1 = re. compile (p1) # compile matcher1 = re. search (pattern1, key) # query print matcher1.group (0)
After reading this code, Do you think: slot? Is this a regular expression? Just write it directly?
Indeed, regular expressions are not as strange as they are on the surface. If we didn't intentionally change the meaning of some symbols, what you see is what you want to match.
Therefore, first clear the brain, first think that the regular expression is the same as the string to be matched. We will gradually evolve in subsequent exercises
Elementary
0. Both python and Regular ExpressionsCase SensitiveSo when you Replace "python" with "Python" in the above example, you cannot match your favorite python.
1. Return to the one in the first example.Match. What if I write like this?

Import rekey = r "
With entry-level experience, we know the twoIt is a common character, but what is in the middle?

.A character in a regular expression represents any character (including itself)
Findall returns a list of all elements that meet the requirements. If there is only one element, it returns the list to you.
Wit, you may suddenly ask: What if I just want to match? What are the results returned to me? There is a character \ in the regular expression. In fact, if you have a lot of programming experience, you will find that this is a "Escape Character" in many places ". In a regular expression, this symbol is usually used to convert a special symbol into a normal one and convert a normal one to a special 23333 (not a special "2333 ", after writing it, you will find that there will be a brain hole ).
For example, you really want to match the "chuxiuhong@hit.edu.cn" Mailbox (my mailbox), you can write the regular expression as below:
import rekey = r"afiouwehrfuichuxiuhong@hit.edu.cnaskdjhfiosueh"p1 = r"chuxiuhong@hit\.edu\.cn"pattern1 = re.compile(p1)print pattern1.findall(key)
We found it..With an escape character\But it does not mean matching "\.", but just matching!
I don't know if you are careful. Have you found that we used it for the first time?.Later,+? What is the plus sign?
In fact, it's not hard to think about it. We said,".The regular expression represents any character (including itself), but "hello world" is not a character.
+ Repeats or repeats the previous character or subexpression multiple times.
For example, if the expression "AB +" can match "abbbbb", but cannot match "a", it requires that you have to have a B. If it is more or less, it will not work. If you ask me if I have the "Are there any, how many expressions are there?", the answer is yes.
* Expression following other symbols can match zero or multiple times.
For example, we have a link in Wang Ye, which may start with http: // or https: //. What should we do?
Import rekey = r "http://www.nsfbuhwe.com and https://www.auhfisna.com" # URLs, do not care about p1 = r "https *: //" # See the Asterisk! Pattern1 = re. compile (p1) print pattern1.findall (key)
Output
['http://', 'https://']
2. for example, if we have such a string "cat hat mat qat", you will find that the first three are actual words, finally, I made a mess (the abbreviation of the Queensland College of English = ). If you already know that "at" is preceded by "c, h, and m", this is the word. You want to match it like this. Based on the learned knowledge, do you think of writing three regular expressions for matching? Actually no. Because there is a multi-character match Method
[]Represents any character in the match
For example, we found that some programmers are too busy.This pair of tags is case-insensitive, so we cannot grasp what we want. How should we deal with it? Is the write 16*16 Regular Expressions matched one by one? No
 import rekey = r"lalala
Output
['

Since we have a range matching, we naturally have a range exclusion.
[^]It indicates that all characters except internal characters can be matched.
In the example of cat, hat, mat, and qat, we want to match the content except qat, so we should write it like this:
Import rekey = r "mat cat hat pat" p1 = r "[^ p] at" # This means all matches pattern1 = re except p. compile (p1) print pattern1.findall (key)
Output
To help us write simple regular expressions, it also provides the following method:


          
           
            
            Regular Expression 
            Matched characters 
            
           
           
            
            [0-9] 
            0123456789 any 
            
            
            [A-z] 
            Any lowercase letter 
            
            
            A-Z 
            Any one of uppercase letters 
            
            
            \ D 
            Equivalent to [0-9] 
            
            
            \ D 
            Equivalent to [^ 0-9] matching non-Numbers 
            
            
            \ W 
            Equivalent to [a-z0-9A-Z _] matching upper and lower case letters, numbers, and underscores 
            
            
            \ W 
            Equivalent to [^ a-z0-9A-Z _] equivalent to the previous non- 
            
           
         

3. Here, we may have mastered the General Construction Method of regular expressions, but we often encounter some inaccurate matching problems in practice. For example:
Import rekey = r "chuxiuhong@hit.edu.cn" p1 = r "@. + \. "# I want to match @ until". ", here is hitpattern1 = re. compile (p1) print pattern1.findall (key)
Output result
['@hit.edu.']
Oh! How can you get more? My ideal result is @ hit. Why did you add more? This is because the regular expression is greedy by default. As we have mentioned earlier, "+" indicates that the characters are repeated once or multiple times. However, we didn't elaborate on how many times this operation was performed. Therefore, it will give us more matching characters as greedy as possible. In this example, It is matched to the last ".".
How can we solve this problem? Add "?" after "+". That's all.
Import rekey = r "chuxiuhong@hit.edu.cn" p1 = r "@. +? \. "# I want to match between @ and until". ". Here, hitpattern1 = re. compile (p1) print pattern1.findall (key)
Output result
['@hit.']
Added a "?" We changed the greedy "+" to the Lazy "+ ". This applies to [abc] +, \ w *, and so on.
Quiz: The above example can be used to get the same result without lazy matching.
** Personal suggestion: When you use "+", "*", you must first consider whether the greedy or lazy type is used, especially when you are using a project with a large scope, it is very likely that it will match more characters to return to you !!! **
Regular Expressions also provide
{A, B} (representing a <= matching Times <= B)
For example, we have sas, saas, and saaas. What should we do if we want sas and saas?
import rekey = r"saas and sas and saaas"p1 = r"sa{1,2}s"pattern1 = re.compile(p1)print pattern1.findall(key)
Output
['saas', 'sas']
If you omit 2 in {1, 2}, it means that the match is at least once, it is equivalent?
If you omit 1 in {1, 2}, it indicates that it can be matched twice at most.
The following lists the metacharacters in some regular expressions and their functions.


          
           
            
            Metacharacters 
            Description 
            
           
           
            
            . 
            Represents any character 
            
            
            \ 
             
            
            
            [] 
            Match any internal character or subexpression 
            
            
            [^] 
            For character sets and non- 
            
            
            - 
            Define an interval 
            
            
            \ 
            Take the next character as a non-operator (generally, it is changed from normal to special to normal) 
            
            
            * 
            Match the previous character or subexpression 0 or multiple times 
            
            
            *? 
            Matches the previous one with inertia. 
            
            
            + 
            Match the previous character or subexpression once or multiple times 
            
            
            +? 
            Matches the previous one with inertia. 
            
            
            ? 
            Matches the previous character or subexpression 0 times or 1 time 
            
            
            {N} 
            Match the first character or subexpression 
            
            
            {M, n} 
            Match the first character or subexpression at least m times to n times 
            
            
            {N ,} 
            Match the first character or subexpression at least n times 
            
            
            {N ,}? 
            Previous inertia match 
            
            
            ^ 
            Match the start of a string 
            
            
            \ 
            Match the start of a string 
            
            
            $ 
            Match string ends 
            
            
            [\ B] 
            Escape Character 
            
            
            \ C 
            Match A Control Character 
            
            
            \ D 
            Match any number 
            
            
            \ D 
            Match characters other than numbers 
            
            
            \ T 
            Match tabs 
            
            
            \ W 
            Match any number, letters, and underscores 
            
            
            \ W 
            Do not match numbers, letters, and underscores 
            
           
         

The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message and share it with us!

Regular Expression	Matched characters
[0-9]	0123456789 any
[A-z]	Any lowercase letter
A-Z	Any one of uppercase letters
\ D	Equivalent to [0-9]
\ D	Equivalent to [^ 0-9] matching non-Numbers
\ W	Equivalent to [a-z0-9A-Z _] matching upper and lower case letters, numbers, and underscores
\ W	Equivalent to [^ a-z0-9A-Z _] equivalent to the previous non-

Metacharacters	Description
.	Represents any character
\
[]	Match any internal character or subexpression
[^]	For character sets and non-
-	Define an interval
\	Take the next character as a non-operator (generally, it is changed from normal to special to normal)
*	Match the previous character or subexpression 0 or multiple times
*?	Matches the previous one with inertia.
+	Match the previous character or subexpression once or multiple times
+?	Matches the previous one with inertia.
?	Matches the previous character or subexpression 0 times or 1 time
{N}	Match the first character or subexpression
{M, n}	Match the first character or subexpression at least m times to n times
{N ,}	Match the first character or subexpression at least n times
{N ,}?	Previous inertia match
^	Match the start of a string
\	Match the start of a string
$	Match string ends
[\ B]	Escape Character
\ C	Match A Control Character
\ D	Match any number
\ D	Match characters other than numbers
\ T	Match tabs
\ W	Match any number, letters, and underscores
\ W	Do not match numbers, letters, and underscores

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Regular Expressions (beginner), python Regular Expressions

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support