Python crawler (iv)--python regular expressions

Last Update:2015-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the learning process of reptiles, another knowledge point that you have to master is the regular expression.

The crawler needs to crawl the things you need, then to filter the results of crawling, regular expressions play this role

If you learn any language, believe that you will be exposed to regular expressions. And the regular expressions are mostly the same.

Anyway, just like the beginning, because this is a basic reptile tutorial. So this article is going to tell you a little bit about python.

The regular expression. Let's get down to the chase.

A regular expression is a special sequence of characters that can help you easily check whether a string matches a pattern. Python has added the RE module since version 1.5, which provides a Perl-style regular expression pattern.

The RE module enables the Python language to have all the regular expression functionality.

The compile function generates a regular expression object based on a pattern string and an optional flag parameter. The object has a series of methods for regular expression matching and substitution.

The RE module also provides functions that are fully consistent with these methods, which use a pattern string as their first parameter.

First step, look at the regular expression pattern

A pattern string uses a special syntax to represent a regular expression:

Letters and numbers denote themselves, and letters and numbers in a regular expression pattern match the same string. Here are some patterns.

1.1

^ --matches the beginning of the string, matching the beginning of each line in multiline mode

Instance:

Import revalue = "Hello python" value_last = Re.match (R ' ^hello ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; Span= (0, 5), match= ' Hello ' >

If you change it:

Import revalue = "Hello python" value_last = Re.match (R ' ^python ', value) print (value_last)

The result of the output is:

None

1.2

$ --matches the end of the string, matching the end of each line in multiline mode

This is the same as 1.1 usage, no longer an example here.

1.3

. --matches any character, except for newline characters, when re. When the Dotall tag is specified, it can match any character that includes a line feed

How to use it? See Example:

Import revalue = "Hello python" value_last = Re.match (R ' ^.ello ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; Span= (0, 5), match= ' Hello ' >

You can also do this:

Import revalue = "Hello python" value_last = Re.match (R ' ^.ello ... ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; span= (0,), match= ' Hello python ' >

Good naughty to say.

1.4

\ --escape character, so that the latter character changes the original meaning

What do you mean? Like 1.3 of "." It would have meant matching any one character, but adding "\" would mean the decimal point "."

If you still can not understand the example:

First try to put in the 1.3 example. Plus \

Import revalue = "Hello python" value_last = Re.match (R ' ^\.ello ', value) print (value_last)

The result of the output is:

None

But if so, let's change the value of

Value = ". Hello"

At this point you need to match ".", how to match it? Maybe you'll say "." But what if it's "*"?

Value = Re.match (R ' ^\). Ello ', value)

The result of the output is:

<_sre. Sre_match object; span= (0, 6), match= '. Hello ' >

I hope you understand the previous expression.

1.5

[...] --character set, the corresponding position can be any character in the character set. Characters in a character set can be listed individually or in a range

, such as [ABC] or [A-c], if the first character is ^ to indicate the inverse, such as [^ABC] means other characters that are not ABC

, all special characters lose their original meaning in the character set. Represents a set of characters, listed separately: [ABC] matches ' A ', ' B ', ' C '.

Import Re value = "Hello" value_last = Re.match (R ' ^H.[A-Z][^ABCE] ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; Span= (0, 4), match= ' hell ' >

1.6

* --match the previous character 0 or an infinite number of times

That's what it means:

Import Re value = "Heo" Value_last = Re.match (R ' Hel*o ', value) print (value_last)

You see, although the expression does not have l but I don't care to add l but I can't delete it, I have to say it before. Can match the previous character 0 times

Then I'm going to write this.

Value = ' Value_last = Re.match (R ' Hel*o ', value) '

Value_last = Re.match (R ' Hel*o ', value)

As a result, you should have thought of it, yes: <_sre. Sre_match object; span= (0, one), match= ' Hellllllllo ' >

1.7

+ --match the previous character 1 or infinite times, it's better than * freshman, no example.

1.8

? --match the previous character 0 or one time, and this doesn't have to be said.

1.9

{m} --match the previous character m times

1.11

{M,n} --match the previous character M to n more than a few needless to say! If you do not understand it, you should learn from the beginning!

1.12

A|b --match A or b

Import Re value = "Hellllllllo" Value_last = Re.match (R ' h|e ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; span= (0, 1), match= ' H ' >

1.13

(...) --matches the expression in parentheses, and also represents a group

See the 1.12 example, you will not ask, if the base of the 1.12 lie that, like match he what to do, of course, Python will help you solve this problem

Import Re value = "Heeho" Value_last = Re.match (R ' (h|e) {2,} ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; Span= (0, 4), match= ' Heeh ' >

1.14

(?: RE) --Similar (...) But does not represent a group

1.15

Before you go on to the special construct, say the modifier of the regular expression-optional flag

A regular expression can contain optional flag modifiers to control the pattern that is matched, the modifier is specified as an optional flag, and multiple flags can be

Bitwise OR (|) They are to be specified, such as re. I|re. M, set to I,m flag:

Re. i--make match case insensitive

Re. l--localization Recognition (locale-aware) matching

Re. m--multi-line matching, affecting ^ and $

Re. s--make "." Match all characters, including line breaks

Re. u--resolves characters based on the Unicode character set, which affects \w,\w,\b,\b

Re. x--this flag by giving you a more flexible format so that you can write regular expressions more easily understandable.

1.16

We're done with the optional flag, so let's continue learning.

(? imx) --The regular expression consists of three optional flags: I,m,x. Affects only the areas in parentheses

See an example:

Import revalue = "Hello" value_last = Re.match (R ' (? i) Hello ', value) print (value_last)

The result of the output is:

<_sre. Sre_match object; Span= (0, 5), match= ' Hello ' >

1.17

(?#...) --Ignore the contents of the following as comments

1.18

(?=...) --After the string content needs to match the expression in order to successfully match.

See specific examples:

Import revalue = "Hello" Value_last = Re.match (R ' (? i) H (? =[a-b]) ', value) print (value_last)

In this case, because the "E" after H is not within the A-B range, the subsequent string content does not match the expression, so

Although (? i) H matches H, it still does not match.

So the result of the output is: None

Import revalue = "Hello" Value_last = Re.match (R ' (? i) H (? =[a-z]) ', value) print (value_last)

In this case, "E" belongs to A-Z and matches the expression, so the result of the output is: H

1.19

(?! ...) --the string that follows requires an unmatched expression to succeed

1.20

(? <= ...) --The previous string requires a matching expression to succeed

1.21

(?<!...) --The previous string requires an unmatched expression to succeed

1.22

After reading the above 1.1 to 1.21 believe the regular expression of the matching pattern you have mastered almost. And look at some simple matching patterns.

\d --match number, equivalent to [0-9]

\d --match non-numeric, equivalent to [^\d]

\s --match whitespace characters, equivalent to [\t\r\n\f\v]

\s -matches non-whitespace characters, equivalent to [^\s]

\w --match word character, equivalent to [a-za-z0-9]

\w -matches non-word characters, equivalent to [^\w]

\a --Match string start

\z --matches the end of the string, if there is a newline, matches only to the end string before the line break

\z --Match string end

\g --matches the position of the last match completion

\b -matches a word boundary, that is, the position between a word and a space. For example, ' er\b ' can match ' er ' in ' never ', but not ' er ' in ' verb '.

\b --matches the non-word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.

\ n, \ t,--match a line break. Matches a tab character. such as

\1...\9 --matches the sub-expression of the nth grouping.

All right, I'm done with the matching pattern in Python, and here's the usual functions in Python regular expressions.

1.1

First, the first function is the one you just used in the instance.

Re.match () --try to match a pattern from the beginning of the string

Syntax: Re.match (pattern,string,flags=0)

pattern is the regular expression to match, string is the character to match, the flags flag bit, which controls how regular expressions are matched

Returns a matching object if the match succeeds, otherwise none is returned

You can use the group (NUM) or groups () Match object function to get a match expression

Group (num = 0) matches a string of the entire expression, and group () can enter multiple group numbers at a time, which, in light case, returns a tuple that contains the corresponding values for the group

Groups () returns a tuple containing all the group strings, from 1 to the included group number

Import revalue = "Hello world,2015!" Value_last = Re.match (R ' (^[a-z]*) \s (\w*), (\d*.) ', value) if Value_last:print (Value_last.group ()) Print (Value_ Last.group (1)) Print (Value_last.group (2)) Print (Value_last.group (3))

The result of the output is:

Hello world,2015!

Hello

World

2015!

1.2

Re.search () --finds a pattern match within the string until the first match is found

Re.search (pattern.string,flags = 0)

Same as match.

The difference between the two is:

Re.match matches only the beginning of the string, if the string starts not responsible for the regular expression, the match fails, the function returns none,

But Re.search () matches the entire string until a match is found

Import revalue = "Hello world,2015!" Value_last = Re.match (R '), value) if Value_last:print (' match--> ', Value_last.group ()) else:print ("No match!") Value_end = Re.search (R '), value) print (' Search--> ', Value_end.group ())

The result of the output is:

No match!

Search--> 2015

1.3

Sub (pattern,repl,string,max=0)

Match occurrences of the replacement string

Import Rephone = "2015-5-31 # this is my num" num = re.sub (R ' #.*$ ', "", phone) print ("Phone Num:", num) num = re.sub (R ' \d ', "", ph One) print ("Phone num:", num)

The result of the output is:

Phone num:2015-5-31

Phone num:2015531

1.4

Split (Pattern,string[,maxsplit])

Divides a string into a list after being able to match a string, maxsplit specifies the maximum number of splits

Import revalue = "1.one2.two3.three" Value_last = Re.split (R ' \d. ', value) print (value_last)

The result of the output is:

[', ' one ', ' one ', ' three ']

1.5

FindAll (Pattern.string[,flags])

Search string to return all matching substrings as a list

Import revalue = "1.one2.two3.three" Value_last = Re.findall (R ' \d. ', value) print (value_last)

The results returned are:

[' 1. ', ' 2. ', ' 3. ']

1.6

Finditer (Pattern,string[,flags])

Returns an iterator that can be accessed sequentially

Import revalue = "1.one2.two3.three" Value_last = Re.finditer (R ' \d. ', value) for x in Value_last:print (X.group ())

The results returned are:

If you don't know about iterators, please read my previous blog.

The regular expression of the right side, the regular expression needs their own frequent contact, with the three of the foundation

Set (), deque, regular expression, next we will be able to explain in detail how to write a crawler.

Zhongzhiyuan Nanjing 904727147, Jiangsu

Python crawler (iv)--python regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler (iv)--python regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler (iv)--python regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support