- Understanding Regular Expressions
A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters. Regular expressions are a very powerful tool for matching strings, and in other programming languages there is also the concept of regular expressions, and Python is no exception, and using regular expressions, we want to extract what we want from the returned page content.
- The approximate matching process of regular expressions
- Take out the expression in turn and compare the characters in the text,
- If each character matches, the match succeeds, and the match fails once there is a character that matches unsuccessfully.
- If there are quantifiers or boundaries in an expression, the process is slightly different.
- Syntax rules for regular expressions (Python)
- Greedy mode and non-greedy mode of quantitative words
Regular expressions are typically used to find matching strings in text. The number of words in Python is greedy by default (which may be the default non-greedy in a few languages), always trying to match as many characters as possible, and not greedy, instead, always trying to match as few characters as possible. For example, if the regular expression "AB" is used to find "ABBBC", "abbb" will be found. If you use a non-greedy quantity word "ab?", you will find "a".
Note: We generally use non-greedy mode to extract.
As with most programming languages, "\" is used as an escape character in regular expressions, which can cause a backslash to be plagued. If you need to match the character "\" in the text, you will need 4 backslashes "\ \" in the regular expression expressed in the programming language: the first two and the last two are used to escape the backslash in the programming language, converted to two backslashes and then escaped in the regular expression into a backslash.
The native string in Python solves this problem well, and the regular expression in this example can be expressed using R "\". Similarly, a "\d" that matches a number can be written as r "\d".
Python has its own RE module, which provides support for regular expressions. The main usage examples are as follows:
# return Pattern Object Re.compile (String[,flag]) # The following functions are used to match the
PATTERNG Concept:
Pattern can be understood as a matching pattern, so how do we get this matching pattern? Very simply, we need to use the Re.compile method. For example
Pattern = Re.compile (r'hello')
In the argument we pass in the native string object, build a pattern object by compiling the compile method, and then we use this object for further matching.
In addition, you may notice another parameter, flags, explaining the meaning of this parameter here:
The parameter flag is a matching pattern, and the value can use the bitwise OR operator ' | ' To take effect at the same time, such as re. I | Re. M.
The optional values are:
? re. I (full spell: IGNORECASE): Ignoring case (full notation in parentheses, same as below)? Re. M (full spell: MULTILINE): Multiline mode, changing the behavior of '^' and '$' (see)? re. S (full spell: dotall): Point any match mode, change '. ' behavior? re. L (full spell: locale): Make the predetermined character class \w \w \b \b \s \s depends on the current locale setting? Re. U (full spell: Unicode): Make a predetermined character class \w \w \b \b \s \s \d \d depends on the character attributes of the UNICODE definition? Re. X (full spell: VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments.
We need to use this pattern in a few other ways, such as Re.match, which we have described below.
Note: The following seven methods of flags also represent the meaning of the matching pattern, if the pattern generated by the flags have been indicated, then in the following method does not need to pass this parameter.
Re.match tries to match a pattern from the beginning of the string.
function Syntax:
Re.match (Pattern, string, flags=0)
Function parameter Description:
Parameters |
Describe |
Pattern |
Matches a regular expression. |
String |
The string to match. |
Flags |
A flag bit that controls how regular expressions are matched, such as case sensitivity, multiline matching, and so on. |
The match succeeds Re.match method returns a matching object, otherwise none is returned.
We can use the group (NUM) or groups () matching object function to get a match expression
Matching Object methods |
Describe |
Group (num=0) |
A string that matches the entire expression, group () can enter more than one group number at a time, in which case it returns a tuple that contains the corresponding values for those groups. |
Groups () |
Returns a tuple containing all the group strings, from 1 to the included group number. |
Instance:
1 #!/usr/bin/python2 ImportRe3 4line ="Cats is smarter than dogs"5 6Matchobj = Re.match (r'(. *) is (. *?). *', line, re. m|Re. I)7 8 ifMatchobj:9 Print("Matchobj.group ():", Matchobj.group ())Ten Print("Matchobj.group (1):", Matchobj.group (1)) One Print("Matchobj.group (2):", Matchobj.group (2)) A Else: - Print("No match!!")
The result of the above instance execution:
Matchobj.group (): Cats is smarter than Dogsmatchobj.group (1): catsmatchobj.group (2): Smarter
Re.search tries to match a pattern from the beginning of the string.
function Syntax:
Re.search (Pattern, string, flags=0)
Instance:
1 #!/usr/bin/python2 ImportRe3 4line ="Cats is smarter than dogs";5 6Matchobj = Re.search (r'(. *) is (. *?). *', line, re. m|Re. I)7 8 ifMatchobj:9 Print("Matchobj.group ():", Matchobj.group ())Ten Print("Matchobj.group (1):", Matchobj.group (1)) One Print("Matchobj.group (2):", Matchobj.group (2)) A Else: - Print("No match!!")
Execution Result:
Matchobj.group (): Cats is smarter than Dogsmatchobj.group (1): catsmatchobj.group (2): Smarter
- The difference between Re.match and Re.search
Re.match matches only the beginning of the string, if the string does not begin to conform to the regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.
Instance:
1 ImportRe2 3line ="Cats is smarter than dogs"4 5Matchobj = Re.match (r'Dogs', line, re. m|Re. I)6 ifMatchobj:7 Print("match--matchobj.group ():", Matchobj.group ())8 Else:9 Print("No match!!")Ten OneSearchobj = Re.search (r'Dogs', line, re. m|Re. I) A ifSearchobj: - Print("Search--Matchobj.group ():", Searchobj.group ()) - Else: the Print("No match!!")
Operation Result:
-Matchobj.group (): dogs
The Python re module provides re.sub to replace matches in a string.
Grammar:
Re.sub (Pattern, Repl, String, max=0)
The returned string is replaced by a match that is not repeated on the leftmost side of the re in the string. If the pattern is not found, the character will be returned unchanged.
The optional parameter count is the maximum number of times a pattern match is replaced, and count must be a non-negative integer. The default value is 0 to replace all matches.
Instance:
1 #!/usr/bin/python2 ImportRe3 4Phone ="2004-959-559 # This is Phone number"5 6 #Delete Python-style Comments7num = Re.sub (r'#.*$',"", phone)8 Print("Phone Num:", num)9 Ten #Remove anything other than digits Onenum = Re.sub (r'\d',"", phone) A Print("Phone Num:", num)
Operation Result:
Phone num: 2004-959-559phone num: 2004959559
Regular Expressions (Python)