Python Regular Expression notes, python Regular Expressions
I recently studied the third chapter of collective programming intelligence, which uses regular expressions for data extraction. There are many online explanations, but I don't think it is specific. There are some terms, it gives people a feeling of not knowing, and there are too few examples in the article. Most of them are text and table-based explanations, which are hard to understand. So after learning the relevant content, write an article about regular expressions in your own words:
What is a regular expression:
A regular expression (or RE) is a small, highly specialized programming language embedded in python and can be operated by the re module. Regular expressions are used in the following situations:
* 1. String Matching
2. Specify string replacement
3. Specify string SEARCH, for example, searching for English statements, e-mail addresses, commands, etc.
4. string segmentation *
First, take a few cold dishes. However, if you have no patience or foundation, you can directly enter the dinner ~
1. Reintroduction
Although the re module of python cannot meet all complex matching conditions, it is sufficient to effectively analyze complex strings and extract relevant information in most cases.
2. re' s regular expression syntax
The regular expression syntax is as follows:
After reading this article, you may be a little impatient. It's just a good dish that begins to go to the table.
To use a regular expression in python, we need to define a regular expression first, and then match your regular expression with the string you need to implement the desired function, so how to define this regular expression? Let's look at an example first.
Import res = r 'abc' # Regular Expression print re. findall (s, "aaaa") # match the regular print re in the target string "aaaa. findall (s, "abcaaaaaaa ")
Output
[]
['Abc']
Here, do you want to ask why there is a 'R' in the regular expression, because there are various characters, when they are used in the regular expression, it will have special significance. To avoid any confusion in processing regular expressions, the original string will be used as the r'expression (expression is the original data, as the matching item) ", that is, no matter whether it is like '\', '+' is a special character that matches with the original data without considering its special meaning.
There are several common metacharacters in regular expressions, which make matching rules more flexible.
. ^ $ * + ? {} [] \ | ()
Inputimport rest = "top tip tqp twp tep" res = r't [io] P' # Find top, tip, that is, match a print re character in. findall (res, st)
Output:
['Top', 'tip ']
import rer = r"[0-9]"print re.findall(r, '1234567')
Output
['1', '2', '3', '4', '5', '6', '7']
Note that metacharacters do not work in character sets []. See the following example:
St = "top tip tqp twp tep" res = r't [^ io] P' # ^ is used to match the words print re except tip and top. findall (res, st)
Output
['Tqp ', 'twp', 'tep']
import res = "hello world, hello boy"r = r"^hello"g = r"boy$"print re.findall(r, s)print re.findall(g, s)
Output
['Hello']
['Boys']
3 .\
A. add different characters to the backend of the backslash to indicate different special meanings.
B. It can also be used to cancel all metacharacters: [or \
The following table lists the special sequence of Regular Expressions:
import rer = r'\d'print re.findall(r, "1254a2")
Output
['1', '2', '5', '4', '2']
Let's write a python applet to match the phone number. Assume that the number format in Beijing is as follows:
010-12345678
Write a program according to the content described above
import rer = r"^010-\d\d\d\d\d\d\d\d"print re.findall(r, "010-12345678")
Output
['010-12345678 ']
However, it is quite troublesome to input 8% d repeatedly. If there are 100 data records, the hand will be lost and it is easy to miss out, there is a type of metacharacters in the regular expression to solve this duplication problem. The above number matching program can be changed to the following:
import rer = r"^010-\d{8}"print re.findall(r, "010-12345678")
Output
['010-12345678 ']
Repeated characters
The specified character can be matched zero or multiple times (up to 2 billion times)
import rer = r"ab*"print re.findall(r, "a")print re.findall(r, "ab")print re.findall(r, "abbbbbb")
Output
['A']
['AB']
['Abbbbbbbb']
2. +
Indicates matching once or multiple times
import rer = r"ab+"print re.findall(r, "a")print re.findall(r, "ab")print re.findall(r, "abbbbbb")
Output
[]
['AB']
['Abbbbbbbb']
Conclusion: The difference between '+' and '*' is that + must be repeated at least once.
3 .?
Matching once or zero times can be considered as optional to identify a thing
4 {m, n}
M, n is an integer, which indicates at least m times, up to n times
import rer = r"a{1,3}"print re.findall(r, 'a')print re.findall(r, 'aa')print re.findall(r, 'aaaa')
Output
['A']
['A']
['Aaa', 'a']
So how does one use regular expressions in python?
The re module in pythonre provides an interface for the Regular Expression Engine. This allows us to compile REstring into objects and use them for matching.
3. Main Function functions of re
Common functions include: compile, search, match, split, findall (finditer), sub (subn)
(1) compile
Re. compile (pattern [, flags])
Purpose: Convert the regular expression syntax into a regular expression object. We often use the compile function to define a regular expression object and then perform matching. Why should we use the compile function, instead of directly defining the regular expression in our program? The reason is: if we use regular expression to perform regular expression matching at a high frequency, we can use the compile function to convert it into a regular expression object, which can increase the matching speed.
The flags definition includes:
Re. I: case insensitive
Re. L: Special Character Set \ w, \ W, \ B, \ B, \ s, \ S depends on the current environment
Re. M: multiline Mode
Re. S: '.' And any characters including line breaks (Note: '.' does not include line breaks)
Re. U: Special Character Set \ w, \ W, \ B, \ B, \ d, \ D, \ s, \ S depends on the Unicode Character Attribute Database
Why do we need to use the option identifier? Let's take a look at the example below:
Import revsvt_re = re. compile (r 'vsvt', re. I) # perform case-insensitive matching. Print vsvt_re.findall ('vsvt ')
Output
Key Content['Vsvt']
(2) search
Re. search (pattern, string [, flags])
Search (string [, pos [, endpos])
Purpose: locate the location matching the Regular Expression Pattern in the string and return the MatchObject instance. If no matching location is found, None is returned.
(3) match
Re. match (pattern, string [, flags])
Match (string [, pos [, endpos])
Purpose: The match () function only tries to match the regular expression at the starting position of the string, that is, it only reports matching conditions starting from position 0, while the search () function () the function scans the entire string for matching. If you want to search the entire string for matching, you should use search ().
The following three small examples show the differences between mathc () and search ().
Import revsvt_re = re. compile (r 'vsvt') print vsvt_re.match ('vsvt hello '). group () # Because match returns an object, you need to call group ()
Output
Vsvt
import revsvt_re = re.compile(r'vsvt')print vsvt_re.match('hello vsvt ')
Output
None
import revsvt_re = re.compile(r'vsvt')print vsvt_re.search('hello vsvt ').group()
** Key content ** output
Vsvt
(4) finditer ()
Find all the substrings matching the RE and return them as an iterator.
import revsvt_re = re.compile(r'vsvt')x = vsvt_re.finditer("vsvt hello vsvt and vsvt")print x.next().group()
Output
Vsvt
(5) sub ()
Function: Used to replace a string.
Sub (pattern, repl, string, count = 0, flags = 0)
Return the string obtained by replacing the leftmost
Non-overlapping occurrences of the pattern in string by
Replacement repl. repl can be either a string or a callable;
If a string, backslash escapes in it are processed. If it is
A callable, it's passed the match object and must return
A replacement string to be used.
Pattern is a regular expression, repl is the object to be replaced, and string to be replaced
import rers = r'c..t'print re.sub(rs, "python", "caat cact kite hello")
Output
Python kite hello
(6) split ()
Function: used to split strings.
import reoper_re = re.compile(r'[\+\-\*]')print re.split(oper_re, "123 + 23-34 * 2")
Output
['123', '23', '34', '2']
When learning related modules, if you want to see the functions of these modules or functions, you can use the following two commands:
Dir (re) # Help us to view the built-in attributes and methods of the re Module
Help (re. sub) # help us view Function
Links to related articles and Videos:
Http://www.jb51.net/article/34642.htm
Http://www.crifan.com/python_re_sub_detailed_introduction/
Http://www.169it.com/article/9913111281939258943.html
Http://www.icoolxue.com/play/1943 (vdio)