Python growth path Article 3 (2) _ regular expressions, the path to python Growth

Last Update:2016-03-01 Source: Internet

Author: User

Tags alphanumeric characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python growth path Article 3 (2) _ regular expressions, the path to python Growth
Add an advertisement. Welcome to linux. python Resource Sharing Group No.: 478616847. directory:

1. What is a regular expression? introduction to regular expressions in python

2. re Module Content

3. Small exercises

1. What is a regular expression (re)

Many people are familiar with regular expressions. in python, regular expressions are supported by the re (regular expression) module. regular Expressions can match text fragments, A regular expression is a common string that can process strings. That is to say, regular expressions are used to process text strings.

Below are some concepts of Regular Expressions:

1. wildcard characters

A wildcard is a special statement that can be used to replace one or more real characters, such '. ', he can replace any character except line break ,. python can be equivalent to xpython, + python, etc.

2. Character Set

Now that the wildcard ". "can represent any character, then the character set can represent a character range such as [a-z] can represent any character of a-z, [a-zA-Z0-9] can also be used to indicate upper and lower case letters and numbers. We can also escape it [^ a], which means not only

Note that the specifiers here are not single \ but double \\

Why are two backslashes used?

This is to escape through the interpreter, and requires two levels of escape: 1. escape through the interpreter; 2. escape through the re module. If you do not want to use two backslashes, consider using the original string, for example, r'python \. org '.

3. Selector

Why is there a selector? The main reason is that if we want to match two strings, such as "aaa" and "bbb", we need to use the pipe sign (|) therefore, when matching, you can write 'aaa | bbb ', if you do not need to match these two strings, you can write them as "p (aaa | bbb)" If you only need to match the string "aaa" or "bbb )"

4. Number (repeated mode)

The number indicates the number of characters. The main pattern is as follows:

(Pattern) *: indicates that this mode can be repeated 0 times or multiple times.

(Pattern) +: indicates that this mode can appear once or multiple times.

(Pattern) {m, n}: Allow this mode to repeat m to n times

(Pattern) {n}: Indicates repeated n times

(Pattern) {n ,}: indicates repeated n or more times, with a minimum of n times

5. start and end

We can use '^ a' to start with a,' $ a' to start with a, or '$ a' to start with.

Ii. re Module Content

Since we know that re is a module, he must have a lot of function functions for us to use. Let's take a look.
1 compile (pattern [, flags]) creates a pattern object based on a string containing a regular expression
2 search (pattern, string [, flags]) in the string to find the Mode
3 match (pattern, string [, flags]) matches the pattern at the beginning of the string
4 split (pattern, string [, maxsplit = 0]) splits the string based on the match of the pattern.
5 findall (pattern, string) lists all matching items of the pattern in the string.
6 sub (pat, repl, string [, count = 0]) replace all pat matching items in the string with repl
7. escape (string) escapes all special regular expression characters in the string
This compile will be discussed in the end.
First, let's look at the function Syntax: re.match(pattern, string, flags=0)
Pattern: matched Regular Expression
String: the string to be matched.
Flags: A flag used to control the matching mode of regular expressions, such as case-sensitive or multi-row matching.
Matching mode:
Use method re. IGNORECASE or re. I
1 I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case 2 make the matching case insensitive; case insensitive for character classes and string matching letters. For example, the [A-Z] can also match lowercase letters, Spam can match "Spam", "spam", or "spAM ". This lowercase letter does not take the current position into consideration. 3 4 L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale 5 affects "w," W, "B, and" B, depending on the current localization settings. 6 locales is a function in the C language library and is used to help you consider programming in different languages. For example, if you are processing French text, you want to use "w + to match the text, but" w only matches the character class [A-Za-z]; it does not match "é" or "? ". If your system is configured appropriately and localized to French, the internal C function will tell the program "é" should also be considered a letter. When the LOCALE flag is used to compile a regular expression, these C functions are used to process the compiled objects after w. This slows down, however, you can use "w +" to match the French text as expected. 7 8 U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale 9 unified to unicode encoding 10 11 M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline12 use "^" to match strings only, $ matches only the end of the string and the end of the string that is directly before the line break (if any. When this flag is specified, "^" matches the start of the string and the start of each line in the string. Similarly, $ metacharacters match the end of a string and the end of each line in the string (directly before each line break ). 13 14 S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline15 to make ". "special characters fully match any character, including line breaks; without this sign ,". "match any character except line breaks. 16 17 X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments18 this flag gives you a more flexible format so that you can write regular expressions more easily. When the flag is specified, the blank character in the RE string is ignored unless the blank character is in the character class or after the backslash; this allows you to organize and indent RE more clearly. It also allows you to write comments to the RE, which will be ignored by the engine. The comments are identified by the "#" sign, but the symbol cannot be behind the string or backslash.

View Code

Expression mode in re

Mode	Description
^	Match the start of a string
$	Matches the end of a string.
.	Match any character. Except for line breaks, when the re. DOTALL mark is specified, it can match any character including line breaks.
[...]	It is used to represent a group of characters and is listed separately: [amk] matches 'A', 'M', or 'K'
[^...]	Character not in []: [^ abc] matches characters other than a, B, and c.
Re *	Matches zero or multiple expressions.
Re +	Matches one or more expressions.
Re?	Matches 0 or 1 segment defined by the previous regular expression. It is not greedy.
Re {n}
Re {n ,}	Exact match with n previous expressions.
Re {n, m}	Match the segments defined by the regular expression n to m times. Greedy Mode
A \| B	Match a or B
(Re)	G matches the expression in parentheses, which also represents a group
(? Imx)	A regular expression contains three optional symbols: I, m, or x. Only the area in the brackets is affected.
(? -Imx)	The regular expression disables the I, m, or x flag. Only the area in the brackets is affected.
(? : Re)	Similar to (...), but does not represent a group
(? Imx: re)	Use an I, m, or x flag in parentheses
(? -Imx: re)	Do not use the I, m, or x optional flag in parentheses
(? #...)	Note.
(? = Re)	The forward positive identifier. If the regular expression is included in the regular expression, it indicates that the match is successful at the current position. Otherwise, the match fails. However, once the contained expression has been tried, the matching engine has not improved at all; the rest of the pattern also needs to try to the right of the separator.
(?! Re)	A forward negative identifier. Opposite to the positive identifier. The expression contained in the string cannot match the current position of the string.
(?> Re)	Matching independent mode, eliminating backtracking.
\ W	Match letters and numbers
\ W	Match non-alphanumeric characters
\ S	Matches any blank character, which is equivalent to [\ t \ n \ r \ f].
\ S	Match any non-empty characters
\ D	Matching any number is equivalent to [0-9].
\ D	Match any non-digit
\	Match string
\ Z	Match string ends. If there is a line break, only the end string before the line break is matched. C
\ Z	Match string ends
\ G	The position where the last match is completed.
\ B	Match A Word boundary, that is, the position between a word and a space. For example, 'er \ B 'can match 'er' in "never", but cannot match 'er 'in "verb '.
\ B	Match non-word boundary. 'Er \ B 'can match 'er' in "verb", but cannot match 'er 'in "never '.
\ N, \ t, and so on.	Match A linefeed. Match a tab. And so on
\ 1... \ 9	The child expression that matches the nth group.
\ 10	Match the child expression of the nth group if it matches. Otherwise, it refers to the expression of the octal verification code.

[Pp] ython	Match "Python" or "python"
Rub [ye]	Match "ruby" or "rube"
[Aeiou]	Match any letter in brackets
[0-9]	Match any number. Similar to [0123456789]
[A-z]	Match any lowercase letter
A-Z	Match any uppercase letter
A-zA-Z0-9	Match any letter or number
[^ Aeiou]	All characters except aeiou letters
[^ 0-9]	Match characters other than numbers

1. re. match (pattern, string, flags = 0)

Re. match specified content from the starting position according to the model to match a single

Regular Expression
String to be matched
Flag, used to control the matching mode of Regular Expressions

Let's look at an example.

1 import re2 text = "111apple222pear" 3 f1 = re. match ("\ d +", text) 4 if f1: 5 print (f1.group () 6 else: 7 print ("NONE") run it, we can find that the match is 111, so we can be sure that the match matches from the starting position to the matching, and the starting position to the matching is normal, otherwise, an empty "\ d +" is returned, indicating that a matching number appears once or more times. If the number + is changed to {1, 2}, the matching result is 11, this is because {1, 2} indicates matching one or twoMatch2, re. search (pattern, string, flags = 0)

This indicates matching the content in the string according to the pattern, and only matches a single

1 import re2 text = "aaa111apple222pear" 3 f1 = re. search ("\ d +", text) 4 if f1: 5 print (f1.group () 6 else: 7 print ("NONE") in this example, we found that re. search matches 111, so it matches the pattern string from the entire string and only matches the first string.Differences between searh3, group () and groups () 1 import re 2 text = "jnj111apple222pear" 3 f1 = re. search ("([0-9] +) ([a-z] +)", text) 4 if f1: 5 print (f1.group (0), f1.group (1 ), f1.group (2) print (f1.groups () 6 else: 7 print ("NONE ") 8 9 10 let's take a look at the above example and run it to see the difference between group and groups. The result is clearly visible. After the re module matches, it will pass the value to the sub-group, group () by default, if no parameter is written, the entire value is returned. If the parameter is written, the corresponding value is returned, and groups () returns the tuples of the matched value.Group () and groups () 4. re. finadll (pattern, string, flags = 0) 1 the above two examples show that they all match a single value, then, finadll matches all qualified values 2 3 import re 4 text = "jnj111apple222pear" 5 f1 = re. findall ("([0-9] +)", text) 6 if f1: 7 print (f1) 8 else: 9 print ("NONE") 10 execute the above example, the result is a list. The list contains all the values that meet the conditions ([0-9] +) and can be written as (\ d +)Finadll5, re. sub (pattern, repl, string, count = 0, flags = 0) 1 is used to replace the matching string 2 3 import re4 text = "jnj111apple222pear" 5 f1 = re. sub ("([a-z] +)", 'A', text) 6 if f1: 7 print (f1) 8 else: 9 print ("NONE ") the output result is that all letters are converted to uppercase letters. A is similar to str. repalceSub6, re. split (pattern, string, maxsplit = 0, flags = 0) 1 import re2 content = "a1 * b2c3 * d4e5" 3 new_content = re. split ('[\ *]', content, 2) 4 print (new_content) 5 indicates that the * operator is used as the delimiter and remains in the list similar to str. splitSplit7, compile (pattern, flags = 0)

Assign matching rules to objects. This improves the matching speed.

import re
content = "a1b*2c3*d4e5"
aaa = re.compile('[\*]')
new_content = re.split(aaa, content)
print (new_content)

The results of the above example are the same. According to official explanations, this mode can increase the matching rate.

8. escape (string)

Escape all special regular expression characters in the string

import re
content = "a1b*2c3*d4e5"
ccc = re.escape(content)
print(ccc)

Iii. Small exercises

This is enough for us. For complicated matching, We can click Baidu. Below are some small exercises.
① Match the age FieldString: "name: aaa, age: 22, user: 1112"
1 import re2 str_in = 'name: aaa, age: 22, user: 11121 '3 new_str_in = re. findall ("[age] + \: \ d {1, 3}", str_in) 4 # indicates the minimum occurrence of age plus: number plus any number appears 1 to 3 times 5 print (new_str_in)

Exercise 1 ② match all URLs in the string

The string is: "The url is www.aaa.com wwa. ccc. dsa www. cdsa. c"

1 import re2 str_in = "The url is www.aaa.com wwa. ccc. dsa www. cdsa. c "3 new_str_in = re. findall ("www \. \ S *\.. {2, 3} ", str_in) # Add With www. add any non-null characters and any characters to appear 2 to 3 times 4 print (new_str_in)Exercise 2 ③ calculate and replace the values in brackets

The string is: "The name is xiaoyan The money I have (5 + 5), 6-1'

1 import re2 str_in = 'the name is xiaoyan The money I have (5 + 5), 6-1 '3 new_str_in = re. findall ("\ (* \ d + [\ +] + \ d + \) *", str_in) # match the content in the brackets 4 value = new_str_in [0]. strip ('(,)') # obtain the brackets 5 n1, n2 = value. split ('+') # Use + as the split to pay 6 new_value = str (int (n1) + int (n2) # Calculate 7 aaa = str_in.replace (new_str_in [0], new_value) # Replace 8 print (aaa)Exercise 3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More