Python Learning notes (regular expressions)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a regular expression

A regular expression is a special sequence of characters that can help you easily check whether a string matches a pattern. The simplest regular expression is an ordinary string that can match itself. In other words, the regular expression ' python ' can match the string ' python '. You can use this matching behavior to search for patterns in the text, and use the computed values to concurrency a particular pattern, or to fragment the text.

Wildcard characters

Regular expressions can match more than one string, and you can use some special characters to create such patterns. For example, a dot (.) can match any character. When we search with window, we use a question mark (? Matches any one character, the function is the same. Then this type of symbol is called a wildcard character.

Escape of special characters

By the above method, if we want to match "python.org", Can we use ' python.org ' directly? This can be done, but it will also match "pythonzorg", which is not the desired result.

All right! We need to escape it, and we can precede it with a slash. Therefore, "python\\.org" can be used in this example, so that only "python.org" is matched.

Why use a two backslash?

This is to escape through the interpreter, which requires two levels of escape: 1. Escaping through the interpreter, 2. Escaping through the RE module. If you do not want to use two backslashes, consider using the original string, such as: r‘python\.org‘ .

Character

We can use the brackets ([]) to enclose the string to create a character set. You can use a range, such as ' [A-z] ' to match any character from A to Z, and you can combine the range one by one, such as ' [a-za-z0-9] ' to match any uppercase and lowercase letters and numbers.

Inverse character set, you can use the ^ character at the beginning, such as ' [^ABC] ' can match any character except A, B, C.

Select item

Sometimes just want to match the string ' Python ' and ' Perl ', you can use the special characters of the selection: pipe symbol (|). Therefore, the desired pattern can be written as ' Python|perl '.

Sub-mode

However, there are times when you do not need to use a selector for the entire pattern---just part of the pattern. In this case, you can use parentheses to make the desired part, or sub-pattern. The precedent can be written as ' P (Ython | erl) '

Selectable and repeating sub-modes

When you add a question mark to a sub-pattern, it becomes an option. It may appear in a matching string, but is not required.

R ' (heep://)? (www\.)? Python\.org '

Only the following characters can be matched:

' Http://www.python.org '

' Http://python.org '

' Www.python.org '

' Python.org '

(pattern) *: Allow mode to repeat 0 or more times

(pattern) +: Allow mode to repeat 1 or more times

(pattern) {m,n}: Allow mode to repeat m~ n times

For example:

R ' W * \.python\.org ' matches ' www.python.org ', '. python.org ', ' wwwwwww.python.org '

R ' W + \.python\.org ' matches ' w.python.org '; but cannot match '. python.org '

R ' W {3,4}\.python\.org ' can only match ' www.python.org ' and ' wwww.python.org '

Start and end of string

^start with a caret string; $ identify the end of a string with a dollar sign

>>> ' ^python$ '

Re module

The RE module contains a number of functions that manipulate regular expressions, the most common of which are the following:

1 compile (pattern[, flags]) create a Pattern object from a string containing a regular expression

2 Search (pattern, string[, flags]) searching for patterns in strings

3 Match (pattern, string[, flags]) matches the pattern at the beginning of the string

4 Split (pattern, string[, maxsplit=0]) splits a string based on pattern matches

5 FindAll (Pattern, string) lists all occurrences of a pattern in a string

6 Sub (PAT, REPL, string[, count=0]) replaces all Pat matches in a string with REPL

7 Escape (String) escapes all special regular expression characters in a string
(Pattern: a matching regular expression; string: strings to match; flags: flags that govern how regular expressions are matched, such as: case sensitivity, multiline matching, and so on)

Re.compile

A more efficient match can be achieved by converting regular expressions to pattern objects.

Re.search will look for the first substring in the given string that matches the regular table. The function is found to return Matchobject (the value is true), otherwise none is returned (the value is false). Because of the nature of the return value, the function can be used in a conditional statement:

If Re.serch (Pat, String):

print ' Found it! '

Re.match

Matches the regular expression at the beginning of the given string. Therefore, Re.match (' P ', ' Python ') returns True, and Re.macth (' P ', ' Www.python ') returns false.

The difference between Re.match and Re.search

Re.match matches only the beginning of the string, if the string does not begin to conform to the regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.

Importreline="Cats is smarter than dogs"; Matchobj= Re.match (r'Dogs', line, re. m|Re. I)ifMatchobj:Print "match--matchobj.group ():", Matchobj.group ()Else:   Print "No match!!"Matchobj= Re.search (r'Dogs', line, re. m|Re. I)ifMatchobj:Print "Search--Matchobj.group ():", Matchobj.group ()Else:   Print "No match!!"

Operation Result:

-Matchobj.group ():  dogs

Re.split

The string is split according to the pattern's match.

 >>> import   re  >> > Some_text=  alpha,beta,,,,, Gamma Delta   " >>> re.split ("  [,]+   " ,some_text) [  '

From the above example, the return value is a list of substrings. The Maxsplit parameter represents the number of parts of a string that can be split up

>>> Re.split ('[, ]+', some_text,maxsplit=2)['Alpha','Beta','Gamma Delta']>>> Re.split ('[, ]+', Some_text,maxsplit=1)['Alpha','beta,,,,, Gamma Delta']>>>

Re.findall function

Returns all occurrences of a given pattern as a list

Find all the words in a string

>>> pat='[a-za-z]+'>>> text="' Hm ... err--is you sure? ' he said, sounding insecure.">>>Re.findall (pat,text) ['Hm','ERR',' is',' You','sure','He','said','sounding','insecure']

Find punctuation

>>> pat=r'[.? \-",]+'>>> re.findall (pat,text) [' ... ' ' -- ' ' ? ' ' , ' ' . ']

Re.sub function

Used to replace a match in a string.

Re. Sub(pattern, repl, string[, Count=0])

The returned string is replaced by a match that is not repeated on the leftmost side of the re in the string. If the pattern is not found, the character will be returned unchanged.

The optional parameter count is the maximum number of times a pattern match is replaced, and count must be a non-negative integer. The default value is 0, which means replacing all matches

Import re>>> pat='{name}'>>> text='Dear {name} ' >>> re.sub (Pat,'mr.gumby', text)'Dear Mr.gumby'

Re.escape function

is an application function that can escape all characters in a string that might be interpreted as regular operators.

>>> re.escape ('www.python.org')'www\\.python\\.org  '>>> re.escape ('but where is theambiguity')'  but\\ where\\ is\\ the\\ ambiguity'

Matching objects and Groups

In a nutshell, a group is a submodule that is placed inside parentheses, and the number of groups depends on its left bracket. Group 0 is the entire module, so in the following mode:

' There (is a (wee) (Cooper)) who (lived in Fyfe) '

Contains the following groups:

0  There was  a  wee Cooper who in    Fyfe1 was  a  wee  Cooper 2  Wee3  Cooper4 in   Fyfe

An important method for re matching objects

>>> M=re.match (R'www\. *)\.. {3} ','www.python.org')>>> m.group (1)'  Python'>>> m.start (1)4>>> m.end (1)10>>> M.span (1 ) (4, ten)

The Group method returns a string that matches the given group in the pattern, and if there is no group number, the default is 0, as above: M.group () ==m.group (0), or a single string if given a group number.

The Start method returns the start index of a given group match.

The End method returns the ending index of the given group match plus 1;

Span returns the index of the start and end position of the group as a tuple (start,end).

Regular expression modifiers--optional flags

A regular expression can contain some optional flag modifiers to control the pattern that is matched. The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR (|). such as Re. I | Re. M is set to the I and M flags:

E.i	Make the match case insensitive
Re. L	Do localization identification (locale-aware) matching
Re. M	Multiline match, affecting ^ and $
Re. S	Make. Match all characters, including line breaks
Re. U	Resolves characters based on the Unicode character set. This sign affects \w, \w, \b, \b.
Re. X	This flag is given by giving you a more flexible format so that you can write regular expressions much easier to understand.

Regular expression pattern

Mode	Description
^	Matches the beginning of a string
$	Matches the end of the string.
.	Matches any character, except the newline character, when re. When the Dotall tag is specified, it can match any character that includes a line feed.
[...]	Used to represent a set of characters, listed separately: [AMK] matches ' a ', ' m ' or ' K '
[^...]	Characters not in []: [^ABC] matches characters other than a,b,c.
Tel	Matches 0 or more expressions.
Re+	Matches 1 or more expressions.
Re?	Matches 0 or 1 fragments defined by a preceding regular expression, not greedy
re{N}
re{N,}	Exact match n preceding expression.
re{N, m}	Matches N to M times the fragment defined by the preceding regular expression, greedy way
a\| B	Match A or B
(RE)	The G matches the expression in parentheses, and also represents a group
(? imx)	The regular expression consists of three optional flags: I, M, or X. Affects only the areas in parentheses.
(?-imx)	The regular expression closes I, M, or x optional flag. Affects only the areas in parentheses.
(?: RE)	A similar (...), but does not represent a group
(? imx:re)	Use I, M, or x optional flag in parentheses
(?-imx:re)	I, M, or x optional flags are not used in parentheses
(?#...)	Comments.
(? = re)	Forward positive qualifiers. If a regular expression is included, ... Indicates that a successful match at the current position succeeds or fails. But once the contained expression has been tried, the matching engine is not improved at all, and the remainder of the pattern attempts to the right of the delimiter.
(?! Re)	Forward negative qualifier. As opposed to a positive qualifier, when the containing expression cannot match the current position of the string
(?> re)	Match the standalone mode, eliminating backtracking.
\w	Match Alpha-Numeric
\w	Match non-alphanumeric numbers
\s	Matches any whitespace character, equivalent to [\t\n\r\f].
\s	Match any non-null character
\d	Match any number, equivalent to [0-9].
\d	Match any non-numeric
\a	Match string start
\z	Matches the end of the string, if there is a newline, matches only the ending string before the line break. C
\z	Match string End
\g	Matches the position where the last match was completed.
\b	Matches a word boundary, which is the position between a word and a space. For example, ' er\b ' can match ' er ' in ' never ', but not ' er ' in ' verb '.
\b	Matches a non-word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.
\ n, \ t, et.	Matches a line break. Matches a tab character. such as
\1...\9	A sub-expression that matches the nth grouping.
\10	Matches the sub-expression of the nth grouping if it is matched. Otherwise, it refers to an expression of octal character code.

Regular expression instance character class

Example	Description
[Pp]ython	Match "python" or "python"
Rub[ye]	Match "Ruby" or "Rube"
[Aeiou]	Match any one of the letters within the brackets
[0-9]	Match any number. Similar to [0123456789]
[A-z]	Match any lowercase letter
[A-z]	Match any uppercase letter
[A-za-z0-9]	Match any letters and numbers
[^aeiou]	All characters except the Aeiou letter
[^0-9]	Matches characters except for numbers

Special character Classes

Example	Description
.	Matches any single character except "\ n". To match any character including ' \ n ', use a pattern like ' [. \ n] '.
\d	Matches a numeric character. equivalent to [0-9].
\d	Matches a non-numeric character. equivalent to [^0-9].
\s	Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s	Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v].
\w	Matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.
\w	Matches any non-word character. Equivalent to ' [^a-za-z0-9_] '.

Python Learning notes (regular expressions)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More