1. Regular Expressions Common operators
. Represents any single character
[] Character set, giving a range of values to a single character [ABC] denotes a, B, c,[a‐z] represents a to Z single character
[^] Non-character sets, given a single character exclusion range [^ABC] represents a single character other than a or B or C
* Previous character 0 times or unlimited extension abc* for AB, ABC, ABCC, ABCCC, etc.
+ 1 or unlimited expansion of the previous character abc+ means ABC, ABCC, ABCCC, etc.
? The previous character 0 or 1 times extended ABC? Denotes AB, ABC
| Left and right expression any one abc|def means ABC, DEF
{m} extends the previous character m times ab{2}c represents ABBC
{M,n} extends the previous character M to n times (with N) ab{1,2}c represents ABC, ABBC
^ Match string beginning ^abc means ABC and at the beginning of a string
$ matches end of string abc$ means ABC and at the end of a string
() grouping tag, internal only using | Operator (ABC) means ABC, (ABC|DEF) means ABC, DEF
\d number, equivalent to [0‐9]
\w word character, equivalent to [a‐za‐z0‐9_]
2. Examples of classic regular expressions
^[a‐za‐z]+$ a 26-letter string
^[a‐za‐z0‐9]+$ a string consisting of 26 letters and numbers
^‐?\d+$ string in integer form
^[0‐9]*[1‐9][0‐9]*$ string in positive integer form
[1‐9]\d{5} ZIP code in China, 6-bit
[\u4e00‐\u9fa5] matches Chinese characters
\D{3}‐\D{8}|\D{4}‐\D{7} domestic phone number, 010‐68913536
Regular expressions in the form of IP address strings (IP address divided into 4 segments, 0‐255 per segment)
\d+.\d+.\d+.\d+ or \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
Exact wording 0‐99: [1‐9]?\d
100‐199:1\D{2}
200‐249:2[0‐4]\d
250‐255:25[0‐5]
([[1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).) {3} ([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])
The 3.Re library is the standard library for Python, used primarily for string matching
Call Mode: Import re
Type:
The raw string type (the primitive type) represents a regular expression, and the raw string is a string that does not contain the escape character escaping again, expressed as: R ' text '
A string type that represents a regular expression that supports an escape character
Main function functions:
(1) Re.search (pattern, string, flags=0) searches for the first position in a string that matches a regular expression, returning the Match object
? Pattern: A string or a native string representation of a regular expression
? String: strings to be matched
? Flags: Control flags when regular expressions are used
Re. I Re. IGNORECASE ignores the case of regular expressions, [a‐z] matches lowercase characters
Re. M Re. The ^ operator in the MULTILINE regular expression can start each line of a given string as a match
Re. S Re. The. operator in the Dotall regular expression matches all characters, and the default matches all characters except newline
(2) Re.match (pattern, string, flags=0) matches the regular expression from the beginning of a string, returning the match object
(3) Re.findall (pattern, string, flags=0) search string, return all matching substrings in list type
(5) Re.split (pattern, String, maxsplit=0, flags=0) splits a string by a regular expression match result, returning the list type
Maxsplit: Maximum number of segments, remainder as last element output
(6) Re.finditer (pattern, string, flags=0) searches for a string that returns the iteration type of a matching result, where each iteration element is a match object
(7) Re.sub (Pattern, Repl, String, count=0, flags=0) replaces all substrings in a string that match the regular expression, returning the replaced string
REPL: Replacing a string that matches a string
Count: Maximum number of replacements to match
4. Object-oriented usage: multiple operations after compilation
Regex = Re.compile (pattern, flags=0) compiles a string form of a regular expression into a regular expression object
? Pattern: A string or a native string representation of a regular expression
? Flags: Control flags when regular expressions are used
Regex.search () Searches for the first position in a string that matches a regular expression, returning the Match object
Regex.match () matches the regular expression from the beginning of a string, returning the match object
Regex.findall () search string to return all matching substrings in list type
Regex.Split () splits a string by a regular expression match result, returning the list type
Regex.finditer () searches for a string that returns the iteration type of a matching result, where each iteration element is a match object
Regex.sub () Replaces all substrings in a string that match a regular expression, returning the replaced string
A 5.Match object is a matching result that contains a lot of information to match
Property:. String text to match
Patter object used when the. Re matches (regular expression)
The starting position of the POS regular expression search text
. Endpos the end position of the regular expression search text
Method:. Group (0) to get the matched string
The. Start () match string at the beginning of the original string
The. End () match string at the end of the original string
. span () returns (. Start (),. End ())
6. Greedy Match and minimum match
The RE library defaults to a greedy match, that is, the output matches the longest substring.
Minimum match:
*? Previous character 0 or unlimited expansion, minimum match
+? Previous character 1 or unlimited expansion, minimum match
?? Previous character 0 or 1 expansion, minimum match
{m,n}? Extend the previous character M to n times (with N), minimum match
Python web crawler and Information extraction--6.re (regular expression) library Getting Started