Getting started with Python regular Expressions (introductory article) _python

Source: Internet
Author: User
Tags character set lowercase expression engine

Introduction

First of all, what is the regular expression?

Regular expressions, also known as formal representations, formal representations, regular expressions, regular expressions, regular representations (English: Regular Expression, often abbreviated as regex, RegExp, or re) in code, a concept of computer science. A regular expression uses a single string to describe and match a series of strings that match a syntactic rule. In many text editors, regular expressions are often used to retrieve and replace text that matches a pattern.

Many programming languages support the use of regular expressions for string manipulation. For example, in Perl, a powerful regular expression engine was built. The concept of regular expressions was initially popularized by tool software (such as SED and grep) in Unix. Regular expressions are usually abbreviated as "regex", singular with regexp, regex, plural with regexps, regexes, Regexen.

Quote from Wikipedia Https://zh.wikipedia.org/wiki/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F

Definition is a definition, it is too serious to use. Let's raise a chestnut: if you're writing a reptile, you've got

HTML source for a Web page. There is a section of

You want to extract this Hello world, but if you only do Python string processing, then the first reaction may be

s =  
 

Then look down from this position to the next

At this point, the regular expression is the preferred helper.

Dry Goods start

Entry level

Go on to say the example we just had. What do we do if we take the regular deal with this expression?

Import re
key = r " 
 

You can try to run the above code to see if it is the same as we think (the blogger is in the python2.7 environment) found that the code is very small and simple? Look down. And the regular expression is actually much simpler than the grotesque look.

First of all, from the most basic regular expression of speaking.

Suppose our idea is to match all of the "Python" in a string to. Let's have a try.

Import re
key = r "Javapythonhtmlvhdl" #这是源文本
p1 = r "python" #这是我们写的正则表达式
pattern1 = Re.compile (p1) #同样是编译
matcher1 = Re.search (pattern1,key) #同样是查询
print matcher1.group (0)

After reading this piece of code, do you think: the horizontal groove? This is the regular expression? Just write it straight up, okay?

Indeed, regular expressions are not as exotic as they seem, and if we don't deliberately change the meaning of some symbols, what you see is what you want to match.

So, let's clear the brain first and think that the regular expression is the same as the string you want to match. In the next exercise, we'll evolve gradually.

Primary

0. Both Python and regular expressions are case-sensitive, so when you replace "python" with "Python" in the example above, you can't match your beloved Python.

1. Go back to the match in the first example . What if I write like this?

Import re
key = r " 
 

With the entry level experience, we know that two are ordinary characters, but what's in the middle?

.The word regular expression represents any one character (including itself)

FindAll returns a list of all the elements that meet the requirements, including only one element, and it returns you a list.

Wit as you may suddenly ask: then if I just want to match "." It? And I got everything back. In a regular expression there is a character \, in fact, if you have more programming experience, you will find that this is a lot of places, "escape character." In regular expressions, this symbol is usually used to turn special symbols into ordinary, the ordinary into a special 23333 (not a special "2333", after writing to find out whether there will be a large brain hole to tilt).

To give a chestnut, you really want to match the "chuxiuhong@hit.edu.cn" This mailbox (my mailbox), you can write the regular expression in the following way:

Import re
key = r "Afiouwehrfuichuxiuhong@hit.edu.cnaskdjhfiosueh"
p1 = r "chuxiuhong@hit\.edu\.cn
" Pattern1 = Re.compile (p1)
print Pattern1.findall (key)

Yes, we've added the . escape character to the front, \ but not the match "\." Meaning, but only match "." The meaning!
Do not know you fine not careful, have you found the first time we use . , followed by a + ? What is this plus sign for?
It's not hard to think, we said "the . word regular expression represents any character (including itself)," but "Hello World" is not a character.
The function of + is to repeat the previous character or a subexpression again or more.
For example, the expression "ab+" then it can match to "abbbbb", but can not match to "a", it requires you have to have a B, more than the limit, less. If you ask me if I have a "no, there are many ways to do it", the answer is yes.
* The expression following the other symbol can be matched to it 0 or more times
For example, we encountered a link in the Wang Yene, may have both http://start, but also the beginning of https://, how do we deal with?

Import re
key = R "Http://www.nsfbuhwe.com and Https://www.auhfisna.com" #胡编乱造的网址, don't mind
P1 = r "https*://" #看那个星号!
pattern1 = Re.compile (p1)
print Pattern1.findall (key)

Output

['http://', 'https://']

2. For example, we have such a string "Cat hat Mat Qat", you will find that the first three is the actual word, the last one I made up a mess (Baidu is the English Institute of Queensland to check the abbreviation = =). If you already know that "at" is one of the C, H, and M, then this makes up the word, and you want to match that. Based on the knowledge that has been learned, is it thought that three regular expressions can be written to match? Actually, it's not necessary. Because there's a multiple-character way

[]Represents any one of the characters in the match.

Or to raise a chestnut, we found Ah, some programmers are too much, in this on the label, the size of the mix, the old harm we can not catch what we want, how should we deal with? Is writing a 16*16 type of regular expression matching each other? No

Import re
key = r "Lalala 
 

Output

['

Since we have a range of matching, nature has a scope of exclusion.

[^]Represents a match except for the characters contained inside.

Or Cat,hat,mat,qat This example, we want to match other than qat, then it should be written:

Import re
key = R "Mat Cat hat Pat"
p1 = r "[^p]at" #这代表除了p以外都匹配
pattern1 = Re.compile (p1)
print pattern 1.findall (Key)

Output

To make it easier for us to write concise regular expressions, it also provides the following wording

The
Regular Expressionsmatching character represented by
[0-9] 0123456789 any one
[A-z] Any of the lowercase letters
[A-z] Any of the uppercase letters
\d equivalent to [0-9]
\d equivalent to [^0-9] matching non-numeric
\w equivalent to [a-z0-9a-z_] matching uppercase and lowercase letters, numbers, and underscores
\w equivalent to [^a-z0-9a-z_] equivalent to the previous one

3. Introducing here, we may have mastered the general structure of regular expressions, but we often encounter some inaccurate matching problems in actual combat. Say:

Import re
key = r "chuxiuhong@hit.edu.cn"
p1 = r "@.+\." #我想匹配到 @ back to "." Between that here is hit
pattern1 = Re.compile (p1)
print Pattern1.findall (key)

Output results

['@hit.edu.']

Oh, yo! How can you be more? My ideal result is @hit, why did you give me the extra amount? This is because the regular expression is "greedy" by default, and as we've said before, "+" means that the character repeats one or more times. But we didn't dwell on how many times it was. So it will be as "greedy" as possible to give us matching characters, in this case it matches to the last ".".

How do we solve this problem? Just add one after "+"? "Just fine."

Import re
key = r "chuxiuhong@hit.edu.cn"
p1 = r "@.+?\." #我想匹配到 @ back to "." Between that here is hit
pattern1 = Re.compile (p1)
print Pattern1.findall (key)

Output results

['@hit.']

Added a "?" We change the greedy "+" to the lazy "+". This is true for [abc]+,\w*].

Quiz: The above example can not use lazy matching, think of a way to get the same result

* * Personal advice: When you use "+", "*", you must first think about whether to use greedy or lazy type, especially when you use a larger range of items, because it is very likely that it will match more characters back to you!!! **

In order to accurately control the number of repetitions, the regular expression also provides

{A,b} (represents a<= match number <=b)

Or to raise a chestnut, we have Sas,saas,saaas, we want SAS and SaaS, how do we deal with it?

Import re
key = R "SaaS and SAS and Saaas"
p1 = r "Sa{1,2}s"
pattern1 = Re.compile (p1)
print Pattern1.findal L (Key)

Output

['saas', 'sas']

If you omit 2 from {1,2}, then the match is represented at least once, then it is equivalent to?

If you omit 1 from {1,2}, then it represents a maximum of 2 matches.

The following is a list of metacharacters in regular expressions and their effects

Meta character Description
. Represents any character
\
[ ] Match any character or subexpression inside
[^] For character set and fetch non
- Define an interval
\ Fu Yingfi to the next word (usually special, special to ordinary)
* Matches the preceding character or subexpression 0 or more times
*? Lazy matches Previous
+ Matches one or more occurrences of the previous character or subexpression
+? Lazy matches Previous
? Matches 0 or 1 repetitions of a previous character or subexpression
N Match a previous character or subexpression
{M,n} Matches a previous character or subexpression at least m times up to N times
{N,} Matches a previous character or subexpression at least n times
{N,}? Lazy Match for previous one
^ Match the beginning of a string
\a Match string start
$ Match string End
[\b] Backspace character
\c Match a control character
\d Match any number
\d Matches a character other than a number
\ t Match tab
\w Match any number of letter underscores
\w Do not match digit letter underline

The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring some help, if there are questions you can message exchange, but also hope that a lot of support cloud Habitat community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.