Python Crawler Learning Notes (3) Summary of relevant knowledge points for regular expressions (re modules)

Source: Internet
Author: User

1. Regular expressions

Regular expressions are patterns that can match text fragments.

1.1 Wildcard characters

Regular expressions can match a string for one, and you can create such patterns with special characters. (Image from Cnblogs)

1.2 Escape of special characters

Since in regular expressions, it is sometimes necessary to treat special characters as ordinary characters, you need to escape with ' \ ', for example ' python\\.org ' will match ' python.org ', then why use two backslashes, because two layers of escaping is required, The first is the RE module, which indicates that the regular expression needs to be escaped once, followed by the Python interpreter, which also requires that the syntax of Python be escaped again. Also for this reason, for ' \ ' needs ' \\\\ ' to match. In order to represent the introduction, you can use the original string to process, the above two examples can be written as R ' python\.org ' and R ' \ \ '.

1.3 Character Set

For example, ' [Pj]python ' can match ' python ' and ' Jpython ', [a-za-z0-9] can match any uppercase and lowercase letters and numbers (note: is one). To invert the character set, you can place ' ^ ' at the beginning of the string, such as ' [^ABC] ' to match any character except A,b,c.

Note: If you want to '. ', ' * ', '? ' These special characters are used as literal characters and are escaped with ' \ ', but are not necessary in the character set, although they are legal (because you can adjust the order to resolve them). Remember the following two rules:

A. If ' ^ ' appears at the beginning of the character set, it needs to be escaped unless you want to use it as a character set reversal.

B. Right bracket '] ' and horizontal line '-' are either placed at the beginning of the character set or escaped.

1.4 Selectors and sub-modules

If you want to match only ' python ' and ' Perl ', you can use the SELECT operator pipe symbol ' | ', the pattern can be written as ' Python|perl '

If you do not need to use the selection operator for the entire pattern, but only a subset, you can enclose the desired part in parentheses, for the above example, represented as ' P (Ython|erl) '. Parentheses enclose the section called a submodule (Subpattren).

1.5 optional and repeating sub-modules

Add a question mark after the submodule and it becomes optional.

(pattern)? : Allow mode to occur 0 or 1 times.

(pattern) +: Allow mode to appear 1 or more times.

(pattern) *: Allow mode to appear 0 or more times.

(pattern) {m,n}: Allow mode to appear m~n times.

1.6 Start and end of string

For example, the substring ' www ' in ' www.python.org ' and ' python.www.org ' can match the pattern ' w+ ', but only want ' www.python.org ' to match, then the pattern can be ' ^w+ ', if desired ' The substring ' www ' in python.org.www ' can match ' w+ ', then the pattern must be written as ' $w + '.

2. Functions of the RE module

2.1. Compile

Convert regular expressions to pattern objects for more efficient matching.

Import= re.compile ('(^w+) \.python\.org')

2.2. Search (important)

Finds the first substring in a given string that matches a given regular expression, and if found returns a Matchobject object, the elements in the object can be. Group () (which will then introduce the concept of group) and return none if not found.

You can first determine whether to find the re-fetch element, assuming that the pattern in the example has two groups and returns the first group.

Have_character = Re.search (pattern,text)if not have_character:      Return Have_character.group (1)

2.3. Match

Matches the regular expression at the beginning of the given string, for example, Re.match (' P ', ' Python ') returns to the object Matchobject, that is, the match succeeds, and if you want to match the entire string, you can add the ' $ ' symbol at the end of the pattern (which also matches the end).

2.4. Split

Splits a string according to a pattern match. A string-like split method, but a regular expression can be used to bring up a fixed delimiter string, such as allowing a sequence of commas and spaces of any length to be used to split a string.

' A, B,,,, c  d're.split ('[,]+', text)#  [' A ', ' B ', ' C ', ' d ']

The parameter maxsplit can set the maximum number of splits.

' A, B,,,, c  d're.split ('[,]+', text, maxsplit=2) # [' A ', ' B ', ' C  d ']
2.5. FindAll (important)

The method returns all occurrences in the form of a list.

' A (b +?) C (d+?) E'abbcddeabbbcddde')print  items # items = [(' BB ', ' DD '), (' BBB ', ' DDD ')]

2.6. Sub (pattern, REPL, string[, count=0]) (important)

Replaces all occurrences of pattern in a string with REPL.

Pattern = Re.compile (r'\* ([^\*]+) \*') re.sub (pattern, R'<em >\1</em>'Hello, *world*! ' )#' Hello, <em>world</em>! '

In the sub function three parameters, pattern represents the pattern, REPL represents the target form, and string represents the replacement string to be matched.

Replacement steps:

A. Replace string strings with pattern patterns to match.

B. Rebuilding a string in the form of a target (that is, using the target form to replace a substring that matches the pattern in string) Repl

The most important thing about the power of sub functions is the ability to use group numbers in alternative strings. (Specific content Reference link: http://stackoverflow.com/questions/5984633/python-re-sub-group-number-after-number,/HTTP/ www.crifan.com/python_re_sub_detailed_introduction/)

Re.sub (R'(foo)', R'\g<1>123"foobar  ')#' Foo123bar '

2.7. Escape

If a string is long and contains many special characters, and you do not want to enter a large number of backslashes to escape, you can use this function to escape all characters in the string that may be interpreted as regular operators as plain text characters.

3. Matching objects and groups

The Search,match function of the RE module returns a Matchobject object when a match is found, and for such an object m, you can use M.group () to fetch a group of information, and if the. Group () default group number is 0, the entire string is returned. Group (1 ) returns a single string that matches the first sub-pattern,. Group (2) and so on.

The. Start () method gets the starting index of the corresponding group, and. End () Gets the end index of the corresponding group. span () gives the starting and ending positions of the corresponding group in tuples, with the group number in parentheses, and the default of 0 when the group number is not populated.

4. Greedy and non-greedy mode

The repeating operator is greedy by default.

Pattern = R'\* (. +) \*'re.sub (pattern, R'<em>\1</em>  "*this* is *it*! ' )#' <em>this* is *it</em> '

Visible greedy mode matches the entire contents of the start asterisk to the end asterisk, including the middle two asterisks.

With (. +?) instead of (. +) get a non-greedy pattern, it will match as little content as possible.

Pattern = R'\* (. +?) \*'re.sub (pattern, R'<em>\1</em>"* This* is *it*! ' )#' <em>This</em> is <em>it</em> '

Resources:

"Beginning Python from Novice to Professional"

Https://docs.python.org/2.7/library/re.html

Reprint Please specify:

Http://www.cnblogs.com/wuwenyan/p/4771422.html

Python crawler Learning notes (3) "Regular expressions (re modules) related Knowledge points summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.