Python module-RE module

Last Update:2015-05-04 Source: Internet

Author: User

Tags mail account

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/pipisorry/article/details/45476817
In addition to some of the methods that the Str object comes with, the ability to process re text is powerful.

python relay literal characters

 正则表达式使用反斜杠" \ "来代表特殊形式或用作转义字符，这里跟Python的语法冲突，因此，Python用" \\\\ "表示正则表达式中的" \ "，因为正则表达式中如果要匹配" \ "，需要用\来转义，变成" \\ "，而Python语法中又需要对字符串中每一个\进行转义，所以就变成了" \\\\ "。

The above wording is not very troublesome, in order to make the regular expression more readable, Python specifically designed the original string (raw string), you need to be reminded that the file path is not used when the raw string, there is a trap. Raw string is prefixed with ' R ' as a string, such as r "\ n": two characters "\" and "n" instead of line breaks. This form is recommended when writing regular expressions in Python.

Regular Expression meta-character description

. Match any character other than line break
^ Start of matching string
$ match End of string
[] is used to match a specified character category
？ Repeat 0 to 1 times for previous character characters
* Repeat 0 times to infinity for the previous character
{} repeats M times for the previous character
{M,n} repeats to the previous character M to n times
\d match number, equivalent to [0-9]
\d matches any non-numeric character equivalent to [^0-9]
\s matches any whitespace character equivalent to [FV]
\s matches any non-whitespace character, equivalent to [^ FV]
\w matches any alphanumeric character, equivalent to [a-za-z0-9_]
\w matches any non-alphanumeric character equivalent to [^a-za-z0-9_]
\b Match the beginning or end of a word
Wildcard wildcards Regular expression linux\python\django\notepad++

Basic rules

' ['] ' character set specifier
First, describe the method of setting the character set. A character enclosed in parentheses, indicating a character set that matches any one of the characters contained therein. [abc123], for example, indicates that the character ' a ', ' B ', ' C ', ' 1 ', ' 2 ' 3 ' all meet its requirements. can be matched.
You can also specify a range of character sets by the '-' minus sign in ' ['], for example, you can use [a-za-z] to specify the case of the English letters, because the letters are sorted in order from small to large. You can't turn the order of size upside down, like writing [z-a] is not right.
If a ' ^ ' is written at the beginning of ' ['], it means that the characters in parentheses do not match. such as [^a-za-z] indicates that all English letters are not matched. But if ' ^ ' is not at the beginning, then it is no longer a representation of the non, but represents itself, such as [A-z^a-z] indicates that matches all the English letters and characters ' ^ '.

' | ' or rules
Tie two rules together to ' | ' Connection, which means that one can match as long as one is satisfied. Like what
[a-za-z]| [0-9] means satisfying numbers or letters to match, this rule is equivalent to [a-za-z0-9]
Note: About ' | ' Two points to note:
First, it is no longer represented in ' [' '], but rather represents his own character. If you want to show a ' | ' Outside of ' ['] ' Character, must be guided by a backslash, i.e. '/| ';
Second, its effective range is the whole rule on both sides of it, such as ' dog|cat ' matches ' dog ' and ' cat ', not ' g ' and ' C '. If you want to qualify its valid range, you must wrap it up with a non-capturing group ' (?:) '. For example, to match ' I have a dog ' or ' I have a cat ', it needs to be written R ' I have a (?:d og|cat) ', and cannot be written R ' I have a dog|cat '
Cases

s = ' I had a dog, I had a cat '
Re.findall (R ' I have a (?:d og|cat) ', s)
[' I have a dog ', ' I had a cat '] #正如我们所要的
Let's take a look at what happens without a capturing group:
Re.findall (R ' I have a dog|cat ', s)
[' I have a dog ', ' cat '] #它将 ' I had a dog ' and ' cat ' as two rules
As for the use of the no capturing group, the following will be explained carefully. Skip here first.

'. ' Matches all characters
Matches all characters except the line break ' \ n '. If the ' S ' option is used, all characters including ' \ n ' are matched.
Cases:

S= ' 123 \n456 \n789 '
FindAll (R '. + ', s)
[' 123 ', ' 456 ', ' 789 ']
Re.findall (R '. + ', S, re. S
[' 123\n456\n789 ']

' ^ ' and ' $ ' match string start and end
Note that ' ^ ' cannot be in ' [] ', otherwise the implication will change, please see above ' ['] ' description. In multiline mode, they can match the beginning and end of each line. See the ' M ' option section later in the Compile function description

' \d ' matches numbers
This is an escape character beginning with ' \ ', ' \d ' means matching a number, which is equivalent to [0-9]
' \d ' matches non-numeric
This is the inverse set above, which matches a non-numeric character, equivalent to [^0-9]. Pay attention to their capitalization. Below we will also see the case of many escape characters in Python's regular rules, representing complementary relationships. That's good to remember.

' \w ' matches letters and numbers
Match all English letters and numbers, which is equivalent to [a-za-z0-9].
' \w ' matches non-English letters and numbers
The complement of ' \w ' is equivalent to [^a-za-z0-9].

' \s ' match spacer
That is, matching characters such as whitespace, tabs, carriage returns, and so on, which are equivalent to [\t\r\n\f\v]. (Note that there is a space at the front)
' \s ' matches non-spacer characters
That is, the complement of the spacer, equivalent to [^ \t\r\n\f\v]

' \a ' matches the beginning of the string
Matches the beginning of the string. The difference between it and ' ^ ' is that ' \a ' matches only the beginning of the entire string, and even in ' M ' mode, it does not match the beginning of the other line.
' \z ' matches string end
Matches the end of the string. The difference between it and ' $ ' is that ' \z ' matches only the end of the entire string, and even in ' M ' mode, it does not match the end of the other lines.
Cases:

s= ' 34\n56 78\n90 '
Re.findall (R ' ^\d+ ', S, re. M) #匹配位于行首的数字
[' 12 ', ' 56 ', ' 90 ']
Re.findall (R ' \a\d+ ', S, re. M) #匹配位于字符串开头的数字
[' 12 ']
Re.findall (R ' \d+$ ', S, re. M) #匹配位于行尾的数字
[' 34 ', ' 78 ', ' 90 ']
Re.findall (R ' \d+\z ', S, re. M) #匹配位于字符串尾的数字
[' 90 ']

' \b ' matches word boundaries
It matches the boundary of a word, such as a space, but it is a ' 0 ' length character, and the string that matches it does not include that delimited character. If a match is made with ' \s ', then the matching string will contain that delimiter.
Cases:

s = ' abc abcde BC BCD '
Re.findall (R ' \bbc\b ', s) #匹配一个单独的单词 ' BC ' and does not match when it is part of another word
[' BC '] # only found that single ' BC '
Re.findall (R ' \sbc\s ', s) # matches a separate word ' BC '
[' BC '] #只找到那个单独的 ' BC ', but note that there are two spaces before and after, maybe a little unclear

' \b ' matches non-boundary
Instead of ' \b ', it matches only non-boundary characters. It is also a 0-length character.
Next Example:

Re.findall (R ' \bbc\w+ ', s) #匹配包含 ' BC ' but not with ' BC ' as the beginning of the word
[' BCDE '] #成功匹配了 ' abcde ' in ' bcde ', without matching ' BCD '

' (?:) ' No capturing group
When you want to do something about it as a whole, such as specifying its number of repetitions, you need to surround it with ' (?: ') ', instead of just a pair of parentheses, that will result in an absolute surprise.
Example: matching a repeating ' ab ' in a string

S= ' Ababab abbabb Aabaab '
Re.findall (R ' \b (?: AB) +\b ', s)
[' Ababab ']
If you use only a pair of parentheses, see what the result is:
Re.findall (R ' B (AB) +\b ', s)
[' AB ']
This is because if you use only a pair of parentheses, then this becomes a group.

' (? #) ' Comment
Python allows you to write comments in the regular expression, and the contents between ' (? # ') will be ignored.

The

(? ilmsux) compilation option specifies that the
Python regular can specify options that can be written in findall or compile parameters, or in regular form, to be part of the regular form. This would be convenient in some cases. See the description of the compile function below for the meaning of the specific option. The
compile option ' I ' here is equivalent to ignorecase, L is equivalent to LOCAL, M is equivalent to MULTILINE, S is equivalent to Dotall, U is equivalent to UNICODE, and X is equivalent to VERBOSE.
Notice the case of them. You can specify only a subset when you use it, such as just specifying ignore case and write as ' (? i) ', ignoring case and using multiline mode, which can be written as ' (? im) '.
Also note that the valid range of options is the entire rule, which is written anywhere in the rule, and the option is valid for all the entire regular.
forward definition and back definition
sometimes need to match a string that follows a specific content or precedes a specific content, Python provides a simple forward-and post-defining function, or a preamble to specify and follow a specified function. They are:
' (? <= ...) ' Forward definition
in parentheses ' ... ' represents the string that should appear before the string that you want to match.
' (? = ...) ' After definition
the ' ... ' in parentheses represents the string that should appear after the string you want to match.
Example: You want to find out the comments in the C language, they are included in the '/ ' and ' /', but you don't want the results of the match to include '/ ' and ' /', so you can use:

S=r '/* Comment 1 /code/ Comment 2 */'
Re.findall (R ' (? <=/*). +? (? =*/) ', s)
[' Comment 1 ', ' Comment 2 ']
Note that we still use the minimum match to avoid matching the entire string.
It is important to note that the forward definition of the expression in parentheses must be constant, that is, you cannot write a regular in a forward-delimited parenthesis. For example, if you want to find the number in the middle of the letter in the following string, you can not use the forward definition:
Cases:
s = ' aaa111aaa, bbb222, 333CCC '
Re.findall (R ' (? <=[a-z]+) \d+ (? =[a-z]+) ', s) # Error usage
It will give an error message:
Error:look-behind requires fixed-width pattern

But if you just need to figure out the number of letters that follow, you can write the regular formula in a back-to-side definition:

Re.findall (R ' \d+ (? =[a-z]+) ', s)
[' 111 ', ' 333 ']
If you must match the number in the middle of the letter, you can use the Group method
Re.findall (R ' [a-z]+ (\d+) [a-z]+ ', s)
[' 111 ']
The use of the group will be explained in more detail later.

forward non-definition and posterior non-definition
' (?<!...) ' Forward non-definition (< and! There are no spaces in the middle, and the Makedown editor will <! As comments, do not display ==! Get drunk ... )
Match only if you want the string to be preceded by something other than ' ... '
‘(?! ...)’ Back to non-defined
Match only if you want the string to be followed by the ' ... ' content.
In the example above, I want to match the numbers that are not followed by letters

Re.findall (R ' \d+ (?! \w+) ', s)
[' 222 ']
Note that we use \w instead of [A-z] as above, because if you do, the result will be:
Re.findall (R ' \d+ (?! [a-z]+) ', s)
[' 11 ', ' 222 ', ' 33 ']
It seems a little different from what we expected. The reason for this is that the first two numbers in ' 111 ' and ' 222 ' also meet this requirement. Therefore, it can be seen that the use of the regular formula is to be quite careful, because I began to write like this, see the results only to understand. But Python is easy to experiment with, which is one of the great advantages of scripting language, you can experiment step-by-step, quickly get results without having to go through a cumbersome compilation and linking process. So learn Python will try more, stumbled through, although twists and turns, but also very fun.

basic knowledge of the group
Above we have seen a lot of basic usage of Python's regular style. But if it's just the rules above, there are plenty of things that can be very troublesome, such as the example of a number that is sandwiched between the letters in front of the definition and the latter. It's hard to achieve the goal with the rules I've spoken about, but it's easy to use the group.
' (') ' No named group
The most basic groups are regular expressions enclosed by a pair of parentheses. For example, if the above matches the case of a number in the middle of the letter (\d+), let's review this example:

s = ' aaa111aaa, bbb222, 333CCC '
Re.findall (R ' [a-z]+ (\d+) [a-z]+ ', s)
[' 111 ']
You can see that the FindAll function returns only the content contained in ' () ', and although the previous and subsequent matches are successful, it is not included in the result.

In addition to the most basic form, we can also give the group a name, which is in the form of
‘(? P ...) ' Named groups
‘(? P ' stands for this is a python syntax extension ' <...> ' inside is the name you give to this group, such as you can give a group of all numbers called ' num ', which is in the form of ' (? p\d+) '. After the name is given, we can call the group by name in the following regular format, which is in the form of
‘(? P=name) ' Call a matching named group
Note that the group that is called again is the group that has been matched, that is, the content inside it is the same as the previous named group.
We can look at more examples: note the characteristics of each substring of the following string.

s= ' AAA111AAA,BBB222,333CCC,444DDD444,555EEE666,FFF777GGG '
Let's take a look at the results of the following regular-style returns:
Re.findall (R ' ([a-z]+) \d+ ([a-z]+) ', s) # Find the letters with numbers in the middle
[(' AAA ', ' AAA '), (' FFF ', ' GGG ')]
Re.findall (R ' (? p[a-z]+) \d+ (? P=G1) ', s) #找出被中间夹有数字的前后同样的字母
[' AAA ']
Re.findall (R ' [A-z]+ (/d+) ([a-z]+) ', s) #找出前面有字母引导, median is a number, followed by a letter in the middle of the string and followed by the letter
[(' 111 ', ' AAA '), (' 777 ', ' GGG ')]

We can call the matched named group after the name of the named group, but the name is not required.
' \number ' calls a matched group by ordinal
Each group in the regular pattern has an ordinal number, which is an array of numbers from left to right, starting at 1, and you can call the matched group in the following form
For example, the above to find the middle clip has the number of the same letter before and after examples, can also be written as:

Re.findall (R ' ([a-z]+) \d+\1 ', s)
[' AAA ']
The result is the same.
Let's look at one more example.
S= ' 111aaa222aaa111, 333bbb444bb33 '
Re.findall (R ' (\d+) ([a-z]+) (\d+) (\2) (\1) ', s) #找出完全对称的 numbers-letters-numbers-letters-numbers and letters in numbers
[(' 111 ', ' AAA ', ' 222 ', ' AAA ', ' 111 ')]

Conditional matching feature (Python2.4 later re module)
‘(? (id/name) Yes-pattern|no-pattern) ' Determines whether the specified group has been matched and executes the corresponding rule
The implication of this rule is that if the Id/name specified group succeeds in the previous match, then the Yes-pattern is executed, otherwise the No-pattern is executed in regular style.
For example, such as to match some of the e-mail address, such as [email protected], but some written < [email protected] > that is, with a pair of <> enclosed, a bit is not, to match the two cases, you can write

S= '

Import Regular Expression module

3.1. Import the regular expression module

Import re
3.2. View the regular Expression module method
Dir (re)
[' DEBUG ', ' dotall ', ' I ', ' IGNORECASE ', ' L ', ' LOCALE ', ' M ', ' MULTILINE ', ' S ', ' Scanner ', ' T ', ' TEMPLATE ', ' U ', ' UNICODE ', ' V ' Erbose ', ' X ', ' _maxcache ', 'all ', 'builtins', 'doc', 'file', 'name', ' package ', 'version', ' _alphanum ', ' _cache ', ' _cache_repl ', ' _compile ', ' _compile_repl ', ' _expand ', ' _pattern_type ', ' _pickle ', ' _subx ', ' compile ', ' copy_reg ', ' Error ', ' escape ', ' findall ', ' finditer ', ' match ', ' purge ', ' Search ', ' split ', ' sre_compile ', ' sre_parse ', ' Sub ', ' subn ', ' sys ', ' template ']

methods owned by the Match object Object

1.group ([group1,...])
Returns one or more subgroups that are matched to. If it is a parameter, then the result is a string, if it is more than one parameter, then the result is a tuple of an item. The default value for Group1 is 0 (all matching values will be returned). If the GROUPN parameter is 0, the corresponding return value is the string that matches all, if the value of group1 is [1 ... 99], the string corresponding to the bracket group is matched. If the group number is negative or larger than the group number defined in pattern, the Indexerror exception is thrown. If the pattern does not match, but the group matches, then the group value is none. If one pattern can match more than one, then the group corresponds to the last of the style matches. In addition, subgroups are differentiated from left to right according to parentheses.
>m=re.match ("(\w+) (\w+)", "ABCD Efgh, Chaj")
>m.group () # Match all
' ABCD Efgh '
>m.group (1) # The child group of the first parenthesis.
' ABCD '
>m.group (2)
' Efgh '
>m.group # Multiple parameters return a tuple
(' ABCD ', ' efgh ')
>m=re.match ("(? p\w+) (? p\w+) "," Sam Lee ")

M.group ("first_name") #使用group获取含有name的子组
' Sam '
M.group ("Last_Name")
' Lee '

The parentheses are removed below
>m=re.match ("\w+ \w+", "ABCD Efgh, Chaj")
>m.group ()
' ABCD Efgh '
>m.group (1)
Traceback (most recent):
File "Pyshell#32>", line 1, in
M.group (1)
Indexerror:no such group

If a group matches multiple times, only the last match is accessible:
If a group matches more than one, then only the last match is returned.
>m=re.match (r "(..) + "," a1b2c3 ")
>m.group (1)
' C3 '
>m.group ()
' A1B2C3 '
The default value for group is 0, which returns the string to which the regular expression pattern matches

>s= "AFKAK1AAFAL12345ADADSFA"
>pattern=r "(\d) \w+ (\d{2}) \w"
>m=re.match (Pattern,s)
>print m
None
>m=re.search (Pattern,s)
>m
<_sre. Sre_match Object at 0x00c2fda0>
>m.group ()
' 1aafal12345a '
>m.group (1)
' 1 '
>m.group (2)
' 45 '
>m.group (1,2,0)
(' 1 ', ' a ', ' 1aafal12345a ')

2.groups ([default])
Returns a tuple that contains all child groups. Default is used to set defaults that do not match to the group. Default is "None",
>m=re.match ("(\d+)." ( \d+) "," 23.123 ")
>m.groups ()
(' 23 ', ' 123 ')
>m=re.match ("(\d+).?" (\d+)? "," #这里的第二个 \d no match, using the default value "None"
>m.groups ()
(' + ', None)
>m.groups ("0")
(' 24 ', ' 0 ')

3.groupdict ([default])
Returns a dictionary of all named subgroups that match. Key is the name value, and value is the match to. The default value of the parameter is a child group that does not match. Here is the same argument as the groups () method. The default value is None
>m=re.match ("(\w+) (\w+)", "Hello World")
>m.groupdict ()
{}
>m=re.match ("(? p\w+) (? p\w+) "," Hello World ")
>m.groupdict ()
{' Secode ': ' World ', ' first ': ' Hello '}
As you can see from the example above, groupdict () does not work with subgroups that do not have a name
Python Re Module Usage summary

common regular expression processing functions

1, Re.search
The Re.search function finds a pattern match within the string, only to find the first match and then returns if the string does not match, or none.
Tip: Use Help when we don't use the module method

Help (Re.search)
Search (pattern, string, flags=0)
First parameter: rule
Second argument: Represents the string to match
Third parameter: Peugeot bit, used to control how regular expressions are matched
Examples: The following example Kuangl
Name= "Hello,my name is Kuangl,nice to meet ..."
K=re.search (R ' K (Uan) GL ', name)
Ifk
... printk.group (0), K.group (1)
.. else:
... print "Sorry,not search!"
...
Kuangl Uan

2, Re.match
Re.match tries to match a pattern from the beginning of the string, which is equal to the first word.

Help (Re.match)
Match (pattern, string, flags=0)

First parameter: rule
Second argument: Represents the string to match
Third parameter: Peugeot bit, used to control how regular expressions are matched
Examples: The following example matches the Hello Word

Name= "Hello,my name is Kuangl,nice to meet ..."
K=re.match (r "(\h ...)", name)
If K:
... print k.group (0), ' \ n ', K.group (1)
.. else:
... print "Sorry,not match!"
...
Hello
Hello

re.match与re.search的区别：re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

3, Re.findall
Re.findall a string that matches a rule in the target string

Help (Re.findall)
FindAll (Pattern, string, flags=0)

First parameter: rule
Second parameter: target string
But three parameters: The following can also be followed by a rule selection
The returned result is a list of strings that match the rules, and a null value if there are no strings that match the rules.
Example: Find a mail account

Mail= '[email protected] [email protected] [email protected] ' #第3个故意没有尖括号
Re.findall (R ' (\[email Protected]....[a-z]{3}) ', mail)
[' [Email protected] ', ' [email protected] ', ' [email protected] '

4, Re.sub
Re.sub to replace a string match

Help (Re.sub)
Sub (pattern, REPL, String, count=0)

First parameter: rule
Second parameter: replaced string
The third argument: a string
Fourth parameter: Number of replacements. The default is 0, which means that each match is replaced
Example: Replacing a blank space with a-

Test= "Hi, nice to meet where is is your from?"
Re.sub (R ' \s ', '-', test)
' Hi,-nice-to-meet-you-where-are-you-from? '
Re.sub (R ' \s ', '-', test,5) #替换至第5个
' Hi,-nice-to-meet-you-where is you from? '

5, Re.split
Re.split used to split a string

Help (Re.split)
Split (pattern, string, maxsplit=0)

First parameter: rule
Second argument: string
Third parameter: Maximum split string, default = 0, indicating that each match is split
Example: splitting all the strings

Test= "Hi, nice to meet where is is your from?"
Re.split (R "\s+", test)
[' Hi, ', ' nice ', ' to ', ' meet ', ' you ', ' where ', ' is ', ' you ', ' from? ']
Re.split (R "\s+", test,3) #分割前三个
[' Hi, ', ' nice ', ' to ', ' meet where is you '?]

6, Re.compile
Re.compile can compile a regular expression into a regular object. It is possible to compile regular expressions that are often used as regular expression objects, which can improve some efficiency.

Help (Re.compile)
Compile (pattern, flags=0)

First parameter: rule
Second parameter: Flag bit
Instance:

Test= "Hi, nice to meet where is is your from?"
K=re.compile (R ' \w*o\w* ') #匹配带o的字符串
Dir (k)
['copy', 'deepcopy', ' findall ', ' finditer ', ' match ', ' scanner ', ' Search ', ' split ', ' Sub ', ' Subn ']
Print K.findall (test) #显示所有包涵o的字符串
[' to ', ' I ', ' You ', ' from ']
Print k.sub (lambdam: ' [' +m.group (0) + '] ', test) # Enclose the word containing O in the string in [].
Hi, nice [to] meet [you] where is [you] [from]?

examples of Python regular expressions

When log parsing, assume the given string:
Char str = "10.10.1.1 [2015/04/22 +0800]/ab/cd/? test0=123&test2=234 xxxx "; To get 2015/04/22,/ab/cd/, and 234 equivalents from.

str = "10.10.1.1 [2015/04/22 +0800]/ab/cd/?" test0=123&test2=234 xxxx "
Print (Re.findall ("\d{4}/\d{2}/\d{2}|/\w{2}/\w{2}| (? <=test2=) \d+ ", str))

script to download files with URLLIB2, Re, OS module

!/usr/bin/env python
Importurllib2
Importre
Importos
Url= ' Http://image.baidu.com/channel/wallpaper '
Read=urllib2.urlopen (URL). Read ()
Pat =re.compile (R ' src=\ ' #\ ' "//.+?. JS ">")
Urls=re.findall (Pat,read)
Fori Inurls:
Url=i.replace (' src=\ ' #\ ' "/CODE>,"). Replace (' > ', ")
Try
Iread=urllib2.urlopen (URL). Read ()
Name=os.path.basename (URL)
With open (name, ' WB ') as Jsname:
Jsname.write (Iread)
Except
Printurl, "URL error"
from:http://blog.csdn.net/pipisorry/article/details/45476817

Python module-RE module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More