Python web crawler and Information extraction--6.re (regular expression) library Getting Started

Source: Internet
Author: User
Tags python web crawler

1. Regular Expressions Common operators

. Represents any single character
[] Character set, giving a range of values to a single character [ABC] denotes a, B, c,[a‐z] represents a to Z single character
[^] Non-character sets, given a single character exclusion range [^ABC] represents a single character other than a or B or C
* Previous character 0 times or unlimited extension abc* for AB, ABC, ABCC, ABCCC, etc.
+ 1 or unlimited expansion of the previous character abc+ means ABC, ABCC, ABCCC, etc.
? The previous character 0 or 1 times extended ABC? Denotes AB, ABC
| Left and right expression any one abc|def means ABC, DEF

{m} extends the previous character m times ab{2}c represents ABBC
{M,n} extends the previous character M to n times (with N) ab{1,2}c represents ABC, ABBC
^ Match string beginning ^abc means ABC and at the beginning of a string
$ matches end of string abc$ means ABC and at the end of a string
() grouping tag, internal only using | Operator (ABC) means ABC, (ABC|DEF) means ABC, DEF
\d number, equivalent to [0‐9]
\w word character, equivalent to [a‐za‐z0‐9_]

2. Examples of classic regular expressions

^[a‐za‐z]+$ a 26-letter string
^[a‐za‐z0‐9]+$ a string consisting of 26 letters and numbers
^‐?\d+$ string in integer form
^[0‐9]*[1‐9][0‐9]*$ string in positive integer form
[1‐9]\d{5} ZIP code in China, 6-bit
[\u4e00‐\u9fa5] matches Chinese characters
\D{3}‐\D{8}|\D{4}‐\D{7} domestic phone number, 010‐68913536

Regular expressions in the form of IP address strings (IP address divided into 4 segments, 0‐255 per segment)

\d+.\d+.\d+.\d+ or \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}

Exact wording 0‐99: [1‐9]?\d
100‐199:1\D{2}
200‐249:2[0‐4]\d
250‐255:25[0‐5]
([[1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).) {3} ([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])

The 3.Re library is the standard library for Python, used primarily for string matching

Call Mode: Import re

Type:

The raw string type (the primitive type) represents a regular expression, and the raw string is a string that does not contain the escape character escaping again, expressed as: R ' text '

A string type that represents a regular expression that supports an escape character

Main function functions:

(1) Re.search (pattern, string, flags=0) searches for the first position in a string that matches a regular expression, returning the Match object
? Pattern: A string or a native string representation of a regular expression
? String: strings to be matched
? Flags: Control flags when regular expressions are used

Re. I Re. IGNORECASE ignores the case of regular expressions, [a‐z] matches lowercase characters
Re. M Re. The ^ operator in the MULTILINE regular expression can start each line of a given string as a match
Re. S Re. The. operator in the Dotall regular expression matches all characters, and the default matches all characters except newline
(2) Re.match (pattern, string, flags=0) matches the regular expression from the beginning of a string, returning the match object
(3) Re.findall (pattern, string, flags=0) search string, return all matching substrings in list type
(5) Re.split (pattern, String, maxsplit=0, flags=0) splits a string by a regular expression match result, returning the list type

Maxsplit: Maximum number of segments, remainder as last element output
(6) Re.finditer (pattern, string, flags=0) searches for a string that returns the iteration type of a matching result, where each iteration element is a match object
(7) Re.sub (Pattern, Repl, String, count=0, flags=0) replaces all substrings in a string that match the regular expression, returning the replaced string

REPL: Replacing a string that matches a string
Count: Maximum number of replacements to match

4. Object-oriented usage: multiple operations after compilation

Regex = Re.compile (pattern, flags=0) compiles a string form of a regular expression into a regular expression object
? Pattern: A string or a native string representation of a regular expression
? Flags: Control flags when regular expressions are used

Regex.search () Searches for the first position in a string that matches a regular expression, returning the Match object
Regex.match () matches the regular expression from the beginning of a string, returning the match object
Regex.findall () search string to return all matching substrings in list type
Regex.Split () splits a string by a regular expression match result, returning the list type
Regex.finditer () searches for a string that returns the iteration type of a matching result, where each iteration element is a match object
Regex.sub () Replaces all substrings in a string that match a regular expression, returning the replaced string

A 5.Match object is a matching result that contains a lot of information to match

Property:. String text to match
Patter object used when the. Re matches (regular expression)
The starting position of the POS regular expression search text
. Endpos the end position of the regular expression search text

Method:. Group (0) to get the matched string
The. Start () match string at the beginning of the original string
The. End () match string at the end of the original string
. span () returns (. Start (),. End ())

6. Greedy Match and minimum match

The RE library defaults to a greedy match, that is, the output matches the longest substring.

Minimum match:

*? Previous character 0 or unlimited expansion, minimum match
+? Previous character 1 or unlimited expansion, minimum match
?? Previous character 0 or 1 expansion, minimum match
{m,n}? Extend the previous character M to n times (with N), minimum match

Python web crawler and Information extraction--6.re (regular expression) library Getting Started

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.