Learn Python Regular Expressions

Source: Internet
Author: User
Tags control characters

To use a regular expression to obtain any characters in a text segment, write the following matching rules:
(.*)
After the result is run, the text after the line break cannot be obtained. So I checked the manual and found that in the regular expression, "." (DOT symbol) matches all characters except the linefeed "\ n.

The following are the correct Regular Expression matching rules:
([\ S] *)
You can also use "([\ D] *)" and "([\ W] *)" to represent

 

Character
Description
\
Mark the next character as a special character, a literal character, or a backward reference, or an octal escape character. For example, 'n' matches the character "N ". '\ N' matches a line break. The sequence '\' matches "\" and "\ (" matches "(".
^
Matches the start position of the input string. If the multiline attribute of the Regexp object is set, ^ matches the position after '\ n' or' \ R.
$
Matches the end position of the input string. If the multiline attribute of the Regexp object is set, $ also matches the position before '\ n' or' \ R.
*
Matches the previous subexpression zero or multiple times. For example, Zo * can match "Z" and "Zoo ". * Is equivalent to {0 ,}.
+
Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "Zoo", but cannot match "Z ". + Is equivalent to {1 ,}.
?
Match the previous subexpression zero or once. For example, "Do (ES )? "Can match" do "in" do "or" does ".? It is equivalent to {0, 1 }.
{N}
N is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match two o in "food.
{N ,}
N is a non-negative integer. Match at least N times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match all o in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
{N, m}
Both m and n are non-negative integers, where n <= m. Match at least N times and at most m times. For example, "O {1, 3}" matches the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Note that there must be no space between a comma and two numbers.
?
When this character is followed by any other delimiter (*, + ,?, The matching mode after {n}, {n ,}, {n, m}) is not greedy. The non-Greedy mode matches as few searched strings as possible, while the default greedy mode matches as many searched strings as possible. For example, for strings "oooo", 'O ++? 'Will match a single "O", and 'O +' will match all 'O '.
.
Matches any single character except "\ n. To match any character including '\ n', use a pattern like' [. \ n.
(Pattern)
Match pattern and obtain this match. The obtained match can be obtained from the generated matches set. The submatches set is used in VBScript, and $0… is used in JScript... $9 attribute. To match the parentheses, use '\ (' or '\)'.
(? : Pattern)
Matches pattern but does not get the matching result. That is to say, this is a non-get match and is not stored for future use. This is useful when you use the "or" character (|) to combine each part of a pattern. For example, 'industr (? : Y | ies) is a simpler expression than 'industry | industries.
(? = Pattern)
Forward pre-query: matches the search string at the beginning of any string that matches the pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (? = 95 | 98 | nt | 2000) 'can match "Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1 ". Pre-query does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
(?! Pattern)
Negative pre-query: matches the search string at the beginning of any string that does not match pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (?! 95 | 98 | nt | 2000) 'can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000 ". Pre-query does not consume characters. That is to say, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
(? = ...)
The forward positive identifier. If the regular expression is included in the regular expression, it indicates that the match is successful at the current position. Otherwise, the match fails. However, once the contained expression has been tried, the matching engine has not improved at all; the remaining part of the pattern also needs to try to the right of the separator. For example, Isaac (? = Asimov) will match 'isaac 'only if it's followed by 'asimov'. [B]... it cannot be an expression. The following is the same as [/B].
(?!...)
A forward negative identifier. Opposite to the affirmative specifier. If the contained expression cannot match the current position of the string, it is successful. For example, Isaac (?! Asimov) will match 'isaac 'Only if it's not followed by 'asimov '.
(? <= ...)
Backward affirmative identifier. Matches if the current position in the string is preceded by a match for... that ends at the current position.
(? <!...)
Backward negative identifier. Matches if the current position in the string is not preceded by a match for... this is called a negative lookbehind assertion.
X | y
Match X or Y. For example, 'z | food' can match "Z" or "food ". '(Z | f) Ood' matches "zood" or "food ".
[Xyz]
Character Set combination. Match any character in it. For example, '[ABC]' can match 'A' in "plain '.
[^ XYZ]
Negative value character set combination. Match any character not included. For example, '[^ ABC]' can match 'p' in "plain '.
[A-Z]
Character range. Matches any character in the specified range. For example, '[A-Z]' can match any lowercase letter in the range of 'A' to 'Z.
[^ A-Z]
Negative character range. Matches any character that is not within the specified range. For example, '[^ A-Z]' can match any character that is not in the range of 'A' to 'Z.
\ B
Match A Word boundary, that is, the position between a word and a space. For example, 'er \ B 'can match 'er' in "never", but cannot match 'er 'in "verb '.
\ B
Match non-word boundary. 'Er \ B 'can match 'er' in "verb", but cannot match 'er 'in "never '.
\ CX
Match the control characters specified by X. For example, \ cm matches a control-M or carriage return character. The value of X must be either a A-Z or a-Z. Otherwise, C is treated as an original 'C' character.
\ D
Match a numeric character. It is equivalent to [0-9].
\ D
Match a non-numeric character. It is equivalent to [^ 0-9].
\ F
Match a form feed. It is equivalent to \ x0c and \ Cl.
\ N
Match A linefeed. It is equivalent to \ x0a and \ CJ.
\ R
Match a carriage return. It is equivalent to \ x0d and \ cm.
\ S
Matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S
Match any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T
Match a tab. It is equivalent to \ x09 and \ CI.
\ V
Match a vertical tab. It is equivalent to \ x0b and \ ck.
\ W
Match any word characters that contain underscores. It is equivalent to '[A-Za-z0-9 _]'.
\ W
Match any non-word characters. It is equivalent to '[^ A-Za-z0-9 _]'.
\ XN
Match n, where N is the hexadecimal escape value. The hexadecimal escape value must be determined by the length of two numbers. For example, '\ x41' matches "". '\ X041' is equivalent to '\ x04' & "1 ". The regular expression can use ASCII encoding ..
\ Num
Matches num, where num is a positive integer. References to the obtained matching. For example, '(.) \ 1' matches two consecutive identical characters.
\ N
Identifies an octal escape value or a backward reference. If at least N subexpressions are obtained before \ n, n is backward referenced. Otherwise, if n is an octal digit (0-7), n is an octal escape value.
\ Nm
Identifies an octal escape value or a backward reference. If at least one child expression is obtained before \ nm, the NM is backward referenced. If at least N records are obtained before \ nm, n is a backward reference followed by text M. If none of the preceding conditions are met, if n and m are Octal numbers (0-7), \ nm matches the octal escape value nm.
\ NML
If n is an octal number (0-3) and M and l are Octal numbers (0-7), the octal escape value NML is matched.
\ UN
Match n, where n is a Unicode character represented by four hexadecimal numbers. For example, \ u00a9 matches the copyright symbol (?). Bottom of the form
>>> STR = '2017-2-23 11:00:22'
>>> Pattern = Re. Compile ('(\ D {4})-(\ D {1, 2})-(\ D {1, 2 })(.*)')
>>> Result = pattern. Search (STR)
>>> Reslist = result. Groups ()
>>> Print reslist
('20140901', '2', '23', '11:00:22 ')
Regular Expressions are often used to analyze strings, compile a part of re-matching interest, and divide it into several groups. For example, the header of A RFC-822 is separated into a header name and a value, which can be matched by a regular expression to match the entire header, with a set of matching header names, another set of matching header values is processed.
Groups are identified by "(" and ")" metacharacters. "(" And ")" has a lot of the same meanings in mathematical expressions; they combine the expressions in them into a group. For example, you can use repeated delimiters, such as *, + ,?, And {m, n} to repeat the content in the group. For example, (AB) * matches zero or more duplicate "AB ".
The Group is specified with "(" and ")" and get the index that matches the beginning and end of the text. This can be done by using a parameter group (), start (), end (), and span. Groups are counted from 0. Group 0 always exists; it is the whole re, so the 'matchobject' method regards group 0 as their default parameter. Later we will see how to express the span that cannot get the text they match.
>>> P = Re. Compile ('(A (B) c) D ')
>>> M = P. Match ('abc ')
>>> M. group (0)
'Abcd'
>>> M. Group (1)
'Abc'
>>> M. group (2)
'B'
The groups () method returns a tuple containing all group strings, from 1 to the group number contained.
>>> M = Re. Search ('(? <=-) \ W + ', 'spam-egg ')
>>> Reslist = M. Groups ()
>>> Reslist
()
>>> M. Group ()
'Egg'
>>> M = Re. Search ('(? <=-) (\ W +) ', 'spam-egg ')
>>> Reslist = M. Groups ()
>>> Reslist
('Egg ',)
>>> M. group (0)
'Egg'
>>>

I. Several re Functions

1: Compile (pattern, [flags])
Generate a regular expression object based on the regular expression string pattern and optional flags
Generate a regular expression object (see figure 2)
Flags has the following definitions:
I indicates case-insensitive
L make some special character sets dependent on the current environment
M multi-row mode to make the ^ $ match not only start and end of string, but also match the start and end of a row
S "." matches any character including '\ n'; otherwise,' \ n' is not included'
U make \ W, \ W, \ B, \ B, \ D, \ D, \ s and \ s dependent on the Unicode Character Properties Database
X indicates that some spaces and # comments are ignored in order to write regular expressions, which are more toxic.
S is commonly used,
The application form is as follows:
Import re
Re. Compile (......, Re. s)
2: Match (pattern, String, [, flags])
Match string with pattern. The flag is the same as the compile parameter.
Returns the matchobject object (see figure 3)
3: Split (pattern, string [, maxsplit = 0])
Use Pattern to separate strings
>>> Re. Split ('\ W +', 'words, words .')
['Word', '']
Brackets '()' have special functions in pattern. Please refer to the manual.
4: findall (pattern, string [, flags])
Frequently used,
Searches for non-overlapping pattern-compliant expressions from the string and returns the list.
5: Sub (pattern, REPL, string [, Count])
Repl can be a string or a function.
When repl is a string,
Replace the child string that matches pattern with REPL.
When repl is a function, it matches pattern for each string that does not overlap.
, Call repl (substring), and replace the substring with the return value.
>>> Re. sub (r'def \ s + ([A-Za-Z _] [a-zA-Z_0-9] *) \ s * \ (\ s *\):',
... R 'static pyobject * \ ALB _ \ 1 (void) \ n {',
... 'Def myfunc ():')
'Static pyobject * \ npy_myfunc (void) \ n {'
>>> Def dashrepl (matchobj ):
... If matchobj. group (0) = '-': Return''
... Else: Return '-'
>>> Re. sub ('-{1, 2}', dashrepl, 'Pro ---- Gram-files ')
'Pro -- gram files'
2. Regular Expression objects)
Generation Method: return through re. Compile (pattern, [flags ])
Match (string [, POS [, endpos]); returns string [POs, endpos] matching
Matchobject of pattern (see figure 3)
Split (string [, maxsplit = 0])
Findall (string [, POS [, endpos])
Sub (repl, string [, Count = 0])
These functions are the same as those in the RE module, but the calling form is slightly different.
Re. Several functions are the same as those of the regular expression object. Program If
Using these functions multiple times, several functions of the regular expression object are more efficient.
Iii. matchobject

Use Re. Match (......) And re. Compile (......). Match return
This object has the following methods and attributes:
Method:
Group ([group1,...])
Groups ([Default])
Groupdict ([Default])
Start ([group])
End ([group])
The best way to illustrate these functions is to give an example.
Matchobj = Re. Compile (R "(? P <int> \ D +) \. (\ D *)")
M = matchobj. Match ('3. 14ss ')
# M = Re. Match (R "(? P <int> \ D +) \. (\ D *) ", '3. 14ss ')
Print M. Group ()
Print M. group (0)
Print M. Group (1)
Print M. group (2)
Print M. group (1, 2)
Print M. group (0, 1, 2)
Print M. Groups ()
Print M. groupdict ()
Print M. Start (2)
Print M. String
The output is as follows:
3.14
3.14
3
14
('3', '14 ')
('3. 14', '3', '14 ')
('3', '14 ')
{'Int': '3 '}
2
3.14sss
Therefore, group () and group (0) are returned, matching strings of the entire expression.
In addition, group (I) is the Matching content enclosed by the I "()" in the regular expression.
('3. 14', '3', '14') The problem is best described.

Note: This article is transferred from the Internet

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.