Regular expression learning and the application in Python

Source: Internet
Author: User

Directory:

A special symbol of regular expressions

Ii. several important regular expressions

Third, the application of the RE module of Python

Iv. Reference Documents

A special symbol of regular expressions

Special symbols can be said to be the key to regular expression, master and can be flexible use of important python symbols, basically the regular expression, but I do not dare say that they have reached the point of perfection.

. (dot), the symbol can match any character, except for line breaks, of course. In Python, if there is a re. Dotall, then even a newline character can be matched, and the Dotall is a flag flag.

Eg:re.compile (R ' \b\w. ', Re. Dotall)

Description

1. The dot and the string dot do not confuse, if you do not use the symbol, and you write a regular expression, want to match the point number, you must add an escape character \.

2. The regular expression uses the backslash "\" to represent a special form or as an escape character. There is a conflict between the regular expression character and the ASCII character, for example, \b represents a backspace key, but in a regular expression it represents a character or string as a boundary. Escape characters are required to be used without a conflict, but if each regular expression is used with an escape character, it becomes cumbersome, so the original string is applied, which is primarily to solve the problem, just add r to the regular expression.

^ Indicates the beginning of the matching string. If there are multiple lines of string, then you can add re. M, you can match the start of multiple lines. ^ds

$ The symbol is the opposite of the previous one, and it matches the end.

* indicates that the regular expression on the left side appears 0 or more times or more than 0 times.

+ indicates that the regular expression on its left appears at least once.

Indicates that the regular expression to the left of it appears 0 or 1 times.

Pipe symbol "|", the vertical bar equivalent or, you can match more than one string, you can match the pipe symbol is divided by multiple strings.

[] matches any string within the square brackets b[aeiu]t can match bat, bet, bit, but. [0-9] matches any one number of 0-9.

Curly braces {}, the curly braces can be a single value, or a pair of values, if a single value n, it means matching n times, if a pair of values, it means that the number of matches is a range. {m} {M,n}

\a matches only at the beginning of the string

\b Represents the boundary

\bthe matches the string starting with the

the\b matches a string ending with the

\bthe\b matches the word the

\d matches any number equal to [0-9]

\s represents a blank character, which is a space.

\w represents the character set of the entire character number, equivalent to a A-Z 0-9

\z matches only the end of the string.

[^x] matches characters except the letter x, [^xty] matches characters except the letter xty.

NOTE: All uppercase letters indicate a corresponding mismatch.

() Build Group

The basic of the above is the repetition of a single character, if it is more than one character to use the progenitor

For example: (AB) +, match more than one AB.

Another function of this is the record, which can be viewed with the group,groups in Re.

analyze a form, primarily to illustrate the usefulness of escape characters:

Regular expression: \ (? 0\d{2,3}[)-]?\d{8}

It is usually used to match a fixed phone, preceded by an area code, followed by a number.

We analyze it in detail:

1. The area code in front of the phone is generally except for the first digit 0, there may be two bits 010, there may be 3 bits, 0451. So brackets \d{2,3}; The following numbers are usually 6-bit. \D{8}

2. Now look ahead. \ (?, the preceding \ is the escape character, because the phone number can have two representations ((0451) 57561101, or 0451-57561101). Because the parentheses (also the meta-character of the expression), to match it, the question mark after the escape character indicates that it appears 0 or 1 times.

3.[)-], which indicates a match, or a connector-.

Ii. several important regular expressions

(?......), question mark? followed by a string of characters or other divine horse, followed by a question mark '? The first character or symbol that follows determines the meaning and syntax of the struct. Here are just a few of the different extensions of that form.

1, (? <=xyz) in a string, match to XYZ, and then match the subsequent characters

Eg:r ' (? <=xyz) ABC, ' ADXYZABC ', output is ABC

2, (? = ...) In contrast to the previous, match the previous character.

3, (? P<name>.), match the corresponding characters, and save the text in a group named name.

4, (? p=name) to the named group backtracking reference, it can match any previous specified group matching text, but note that if you look to enclose the quotation marks. M.group (' name ')

(? P<quote>[' "]). *? (? P=quote) (from python.org)

5, (?: ...) represents a non-record grouping, which is not recorded, matches only, does not capture matching text, and does not assign group numbers to this grouping. (When I verify, Groups returns an empty group)

6, (? # ...) It is mainly used to provide annotations and has no effect on regular expressions.

7, (?! XYZ) matches characters that are not followed by XYZ at that location

8. (? <!xyz) matches characters that are not preceded by XYZ in the position

Third, the application of the RE module of Python

Query the re property by calling the Dir function, which is all of the properties that the RE module contains

[' DEBUG ', ' dotall ', ' I ', ' IGNORECASE ', ' L ', ' LOCALE ', ' M ', ' MULTILINE ', ' S ', ' Scanner ', ' T ', ' TEMPLATE ', ' U ', ' UNICODE ', ' VERBOSE ', ' X ', ' _maxcache ', ' __all__ ', ' __builtins__ ', ' __doc__ ', ' __file__ ', ' __name__ ', ' __package__ ', ' __version__ ', ' _alphanum ', ' _cache ', ' _cache_repl ', ' _compile ', ' _compile_repl ', ' _expand ', ' _pattern_type ', ' _pickle ', ' _subx ', ' Compile ', ' copy_reg ', ' Error ', ' escape ', ' findall ', ' finditer ', ' match ', ' purge ', ' Search ', ' split ', ' sre_compile ', ' sre_ Parse ', ' Sub ', ' subn ', ' sys ', ' template ']

Compile

Precompiling a regular expression rule so that subsequent use of the regular expression is convenient, returns a regular object, regex.

Eg:prog = Re.compile (pattern) pattern is a regular expression that you define yourself

Search, Match,findall

The reason they are discussed together is because there are similarities and differences.

The same point is that the search characters are matched according to a specific regular expression;

Different points:

Match is just the beginning of the matching string, and if it does not match, exit and return a none.

Search can match any location of the string, and if the entire string does not match, returns a none

FindAll is almost the same as search, except that it always returns a list, even if it doesn't match, and returns an empty list [], and search and match return Ganso.

result = Prog.match (string)
Print Result.group ()  Group is Zionuzu, if all is displayed, use groups.
result = Prog.search (string)
Print Result.group ()
result = Prog.findall (string)
Print result

Split (pattern,string, max=0) splits the character string into a list based on the delimiter in the regular expression pattern, returning a list of successful matches, dividing the Max times (by default, dividing all matching places). Re.split (': ', ' FE:DW:EFG ')

The sub (pattern, REPL, String, max=0) replaces all occurrences of the regular expression pattern in string strings with the string repl, and if the value of Max is not given, replaces all matching places (also refer to subn (), It also returns a numeric value that represents the number of replacements.

Flags of several special flags:

    • Re. I (Re. IGNORECASE): Ignore case (full notation in parentheses, same as below)
    • M (MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
    • S (dotall): Point any matching pattern, change '. ' The behavior
    • L (LOCALE): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
    • U (Unicode): Make a predetermined character class \w \w \b \b \s \s \d \d Depending on the character attributes of the UNICODE definition
    • X (VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments. The following two regular expressions are equivalent:
    • Dotall, which has been explained in the previous article.

eg

a=re.compile(r"""\d +  # the integral part

\.    # the decimal point

\d *  # some fractional digits""", re.X)

b=re.compile(r"\d+\.\d*")

The following is a simple re module I wrote in Cygwin with Vim, which has been verified.

6 a= ' 800-555-122 ' 7 b= ' 555-123 ' 8 P=re.compile (R ' (\d{3})?-? ( \D{3})-? (\d*) ' 9 print P.findall (a) 101112 C=re.search (R ' [bh][aiu]t ', ' bit ') print C.group () d=re.search (R ' \w+,\s\w ', ' Dywane, W ') print D.group () 1617 url=[' www.baidu.com ', ' www.sina.cn ', ' www.bjtu.edu.cn ']18 q=re.compile (R ' (www.) \w*\. (com|cn|edu) \.? \w* ') for I in range (len (URL)):     try:21         print url[i]22         r=q.search (url[i])         if r!=none:2425             print ' match success ', R.group ()         else:27             pass28     except:29         print ' Error ' 3031 print ' 15-4,match all Python ' s indicator "3233 print re.search (R ' \d+ ', ' 133234 '). Group () Print Re.search (R ' \d+\.? \d* ', ' 2 '). Group () 3536 t=str (type (0)) PNS print T38 w=re.search (r "(? <=type\s\ ') \w+", T). Group () if W==none:40     PASS41 else:42     Print w4344 print re.search (R ' [0-9][0-2]? ', ' 2 '). Group ()

  

Regular expression learning and the application in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.