A regular expression is a powerful tool for working with strings, and he is not some kind of programming cloud.
Regular expressions have an independent endurance engine, and the syntax for regular expressions is the same regardless of programming language.
The matching process of regular expressions
1. Take out the expression and compare the characters in the text at once.
2. If each character matches, the match succeeds, and the match fails once a match is unsuccessful.
3. If there are two or more convenient expressions, the process will be slightly different.
Here are some examples of symbols
[....]
Character set (character Class). The corresponding location can be any character in the character set. Characters in the character set can be listed by Brother Pig, or can be given a range, such as [ABC] or [A-c]. The first character, if it is ^, is reversed if [^ABC] represents other characters that are not ABC. All the special characters in the character set are to go to some original special meaning. In the character set if it is used],-or ^, you can precede the transfer character backslash \, or put],-put in the first character, put ^ in the non-first character.
Predefined character sets (can be written in the character set [....] Medium):
\d number: [0-9]
\d non-numeric: [^\d]
\s White space character:[< space >\t\r\n\f\v]
\s non-whitespace characters: [^\s]
\w Word character: [a-za-z0-9_]
\w Fly Word character: [^\w]
The number of words (used in characters or (...). After
* Match the previous character 0 or unlimited times
+ match 1 or unlimited times before
? Match 0 or 1 times before
{m} matches the previous character m times
{M,n} matches the previous character M to n times (more than n times fails)
M and n can be omitted: if M is omitted, the match is 0 to n; if n is omitted, it matches m to infinity
Boundary match (does not consume characters in the string to be matched)
^ matches the beginning of the string. Matches the beginning of each row in multiline mode.
$ matches the end of the string. Matches the end of each line in a multiline mode.
\a matches only the beginning of the string.
\z matches only the end of the string.
\b Match between \w and \w
\b [^\b]
Logic, grouping:
| Represents an arbitrary match between the left and right expressions. (Analogous to a C language or statement, it always matches the left expression first, and once a successful match skips the expression to the right.) If | is not included in the (), its scope is the entire regular expression. )
(...) The enclosed expression will be grouped, starting from the left side of the expression without encountering a grouped opening parenthesis ' (', number +1. In addition, the fractional expression as a whole can be a number of posterior street words. Only valid in this group in the expression.
(? P<name>, ...) Group, specifying an additional alias in addition to the original number.
The \<number> reference number is the string to which the <number> group matches.
(? P=name) refers to the string that the alias of <name> is matched to.
Special constructs (not grouped):
(?:...) (...) Non-grouped versions for edible ' | ' Or a number of words followed.
Each character of the (? ilmsux) Ilmsux represents a matching pattern that can be used only at the beginning of a regular expression and optionally multiple.
(?#...) #后的内容将作为注释被忽略.
(?=...) The subsequent string content needs to match the expression in order to successfully match. Does not consume string content.
(?! ...) The subsequent string content requires an unmatched expression to match successfully. does not consume strings.
(? <= ...) The previous string content needs to match the expression in order to successfully match. Does not consume string content.
(?<!...) The previous string content requires an unmatched expression to match successfully. Does not consume string content.
(? (id/name) Yes-pattern|no-pattern) If a group with the number ID/alias name is matched to a string, matching yes-pattern is required, otherwise a match no-=attern is required. [No-pattern] can be omitted.
Greedy mode and non-greedy mode of quantitative words
Regular expressions are typically used to find matching strings in text.
Greedy mode: Always try to skim as many characters as possible; (Python's number words are greedy by default)
Non-greedy mode: always try to match as few characters as possible. (in greedy mode * or + after adding, it becomes non-greedy mode)
How to use regular expressions in Python
In Python, a regular expression is supported by a package called "re".
The results are as follows:
Let's analyze the pattern = Re.compile (R ' \d+\.\d* ') this statement:
\d = number [0-9]
+ indicates repeated occurrences of the last match 1 or n times
\. Represents the character '. '
* Indicates repeated occurrences of the last match 0 or n times
R is actually a python that tells the compiler that all escape characters in this string are invalidated and processed according to the original string.
So \d+.\d* actually represents a rule that matches some decimals. However, this expression does not correctly match all decimals, such as ' 0 '. Such characters will also be matched, and this example is purely for the purpose of speaking more than a few symbols.
Since we have established a pattern object that matches the ' \d+.\d* ' rule.
The FindAll method of pattern can be used to match the string we want.
Returns a list of strings [].
Crawler premise--Regular expression syntax and its use in Python