Getting started with python crawlers-full guide to regular expressions (5) and full guide to python

Last Update:2017-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

Regular expressions are used to process text. Most programming languages support regular expressions, which are used in scenarios such as form verification, text extraction, and replacement. The crawler system is inseparable from regular expressions, which often get twice the result with half the effort.

Before introducing the regular expression, let's take a look at the problem. The following text is a webpage link from Douban. I have reduced the content. Q: How do I extract all the email addresses in the text?

Html = "" <style>. qrcode-app {display: block; background: url (/pics/qrcode_app4@2x.png) no-repeat ;} </style> <div class = "reply-doc content"> <p class = ""> 34613453@qq.com, thank you </p> <p class = ""> 30604259@qq.com trouble landlord </p> </div> <p class = "> 490010464@163.com <br/> Thank you </p> """

If you haven't touched the regular expression, I think it will be impossible to solve it. You don't need to use regular expressions. It seems that it is a better way to deal with it. However, we will put this problem down for the moment, after learning the regular expression, consider how to solve it.

String Representation

Python strings are represented in several forms. Strings starting with u are called Unicode strings. They are not covered in this article. In addition, you should have seen these two methods:

>>> foo = "hello">>> bar = r"hello"

The former is a regular string, and the latter r starts with the original string. What is the difference between the two? In the above example, they are all strings composed of common text characters. There is no difference here. The following can be proved:

>>> foo is barTrue>>> foo == barTrue

However, what happens if a string contains special characters? Let's look at an example:

>>> foo = "\n">>> bar = r"\n">>> foo, len(foo)('\n', 1)>>> bar, len(bar)('\\n', 2)>>> foo == barFalse>>>

"\ N" is an escape character, which represents a line break in ASCII. R "\ n" is an original string. The original string does not escape special characters. It is the literal meaning you see, A string consisting of "\" and "n.

The definition of the original string can start with lowercase r or uppercase R, such as r "\ B" or R "\ B. In Python, regular expressions are generally defined in the form of original strings. Why?

For example, for the character "\ B", it has special significance in ASCII, indicating the return key. In a regular expression, it is a special metacharacter, it is used to match the boundary of a word. In order to allow the regular expression compiler to correctly express its meaning, you need to use the original string. Of course, you can also use the Backslash "\" to escape the string defined in general.

>>> foo = "\\b">>> bar = r"\b">>> foo == barTrue

Regular Expressions

A regular expression consists of two types of characters: common text characters and special characters (metacharacters. Metacharacters are of special significance in regular expressions, which allow them to be more expressive. For example, the regular expression r ". d ", the characters 'A' and 'D' are common characters ,'. 'is a metacharacter ,. it can refer to any character. It can match 'a1d ', 'a2d', and 'acd'. The matching process is as follows:

The Python built-in module re is specifically used to process regular expressions.

>>> Rex = r ". d "# Regular Expression text >>> original_str =" and "# original text >>> pattern = re. compile (rex) # Regular Expression Object> m = pattern. match (original_str) # matched object >>> m <_ sre. SRE_Match object at 0x101c85b28 ># equivalent to >>> re. match (r ". d "," and ") <_ sre. SRE_Match object at 0x10a15dcc8>

If the original text string matches the regular expression, a Match object is returned. If the match does not Match, None is returned by the match method. form verification can be performed by judging whether m is None.

Next, we need to learn more characters.

Basic metacharacters

.: Match any character except the line break. For example, "a. c" can fully match "abc" or "abc" in "abcef"
\: Escape Character, so that special characters have the original meaning, for example: 1 \. 2 can match 1.2
[...]: Match any character in square brackets. For example, a [bcd] e can match abe, ace, and ade. It also supports range operations, such: a to z can be expressed as "a-z", 0 to 9 can be expressed as "0-9". Note that special characters in "[]" do not have any special meaning, is its literal meaning, for example :[. *] is a match. or *
[^...]: Returns the inverse of the character set, indicating that any character that does not appear in the brackets can be matched. For example, a [^ bcd] e can match aee, afe, and so on.

>>> re.match(r"a.c", "abc").group()'abc'>>> re.match(r"a.c", "abcef").group()'abc'>>> re.match(r"1\.2", "1.2").group()'1.2'>>> re.match(r"a[0-9]b", "a2b").group()'a2b'>>> re.match(r"a[0-9]b", "a5b11").group()'a5b'>>> re.match(r"a[.*?]b", "a.b").group()'a.b'>>> re.match(r"abc[^\w]", "abc!123").group()'abc!

The group method returns the molecular string (abc) that matches the regular expression in the original string (abcef). The match object is returned only when the Match method is successful, then the group method is available.

Preset metacharacters

\ W matches any word character, including numbers and underscores, which are equivalent to [A-Za-z0-9 _], for example a \ wc can match abc, acc
\ W matches any non-word character, opposite to \ w, it is equivalent to [^ A-Za-z0-9 _], for example: a \ Wc can match! C
\ S matches any blank character. Space and carriage return are all blank characters. For example, a \ SC can be configured with a \ nc. Here \ n indicates carriage return.
\ S matches any non-blank character
\ D matches any number, which is equivalent to [0-9]. For example, a \ dc can match a1c, a2c...
\ D matches any non-digit

Boundary match

Symbols related to boundary matching are used to modify characters.

^ Matches the start of a character before a string. For example, ^ abc indicates a string that matches the beginning of a and followed by bc. It can match abc.
$ Match the end of a character at the end of the string, for example, hello $

>>> re.match(r"^abc","abc").group()'abc'>>> re.match(r"^abc$","abc").group()'abc'

Duplicate match

The previous metacharacters are matched for a single character. If you want to repeat the matched characters, such as the ID card number and the length is 18 characters, you need to use the repeated metacharacters.

* Duplicate matches for zero or more times
? Matches zero times or once
+ Repeat once or multiple times
{N} repeated n times
{N ,}repeat at least n times
{N, m} repeat n to m times

# Simple match ID card number. The first 17 digits are numbers, and the last digit can be numbers or letters X >>> re. match (r "\ d {17} [\ dX]", "42350119900101153X "). group () '42350119900101153x '# match the QQ number from 5 to 12 >>> re. match (r "\ d {5, 12} $", "4235011990 "). group () '000000'

Logical Branch

Match a fixed phone number. Different regions have different rules. Some regions have three phone numbers, while some regions have four phone numbers and seven phone numbers, the area codes and numbers are separated by commas (-). What if they meet this requirement? In this case, you need to use the logical branch condition character |, which divides the expression into the left and right parts. First, try to match the left part. If the match is successful, it will no longer match the next part, this is a logical "or" Relationship

# Abc | cde can match abc or cde, But abc >>> re. match (r "aa (abc | cde)", "aaabccde "). group () 'aaabc'

0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} The expression starts with 0. It can match eight-digit numbers with three-digit area codes, you can also match a four-digit area code with a seven-digit number.

>>> re.match(r"0\d{2}-\d{8}|0\d{3}-\d{7}", "0755-4348767").group()'0755-4348767'>>> re.match(r"0\d{2}-\d{8}|0\d{3}-\d{7}", "010-34827637").group()'010-34827637'

Group

The matching rules described above are for a single character. If you want to repeat multiple characters, the answer is to use a subexpression (also called grouping), and the grouping uses parentheses "() ", for example, (abc) {2} indicates that abc is matched twice. When an IP address is matched, you can use (\ d {1, 3 }\.) {3} \ d {}, because the IP address is composed of three dots in four groups of arrays, all of which can be repeated three times as one group with the first three digits and three dots, the last part is a string consisting of 1 to 3 digits. For example, 192.168.0.1.

For grouping, the group method can be used to extract matching string groups. By default, it regards the matching result of the entire expression as 0th groups, that is, group () without parameters () or group (0). The group in the first group of parentheses is obtained by group (1), and so on.

>>> M = re. match (r "(\ d +) (\ w +)", "123abc") # group 0, matching the entire regular expression >>> m. group () '123abc' # equivalent >>> m. group (0) '123abc' # group 1, matching the first pair of brackets >>> m. group (1) '000000' # group 2, matching the second pair of brackets >>> m. group (2) 'abc' >>>

Through grouping, we can extract the desired information from the string. In addition, groups can be obtained by specifying names.

# The name of the first group is number # the name of the second group is char >>> m = re. match (r "(? P <number> \ d + )(? P <char> \ w +) "," 123abc ")> m. group ("number") '000000' # equivalent >>> m. group (1) '20140901'

Greedy and non-greedy

By default, when a regular expression is repeatedly matched, it matches as many characters as possible when the entire expression can be matched. We call this greedy pattern as a greedy pattern. For example, r "a. * B" indicates matching the string starting with a and ending with B. It can be a string of any number of characters. If it is used to match aaabcb, it will match the entire string.

>>> re.match(r"a.*b", "aaabcb").group()'aaabcb'

Sometimes, we want to have as few matches as possible. What should we do? You only need to add a question mark "? ", Try to match as few as possible when matching is ensured. For example, if we only want to match aaab in the previous example, we only need to modify the regular expression to r" .*? B"

>>> re.match(r"a.*?b", "aaabcb").group()'aaab'>>>

The non-Greedy mode is frequently used in crawler applications. For example, if I have previously written an article on the Public Account "Python Zen" crawling a website and converting it into a PDF file, the img Tag elements on the webpage are relative paths, we need to replace it with an absolute path.

>>> Html = ' ' # Two img tags that match in non-Greedy mode # you can change to the greedy mode to see how many matches you can match> rex = R'  '> Re. findall (rex, html) ['/images/category.png ','/images/js_framework.png ']>

>>> def fun(match):...  img_tag = match.group()...  src = match.group(1)...  full_src = "http://foofish.net" + src...  new_img_tag = img_tag.replace(src, full_src)...  return new_img_tag...>>> re.sub(rex, fun, html)

The sub function can take a function as the replacement target object, and the return value of the function is used to replace the matching part of the regular expression. Here, I define the entire img tag as a regular expression r '',group()The returned value is The return value of group (1) is/images/category.png. Finally, I use the replace method to replace the relative path with the absolute path.

At this point, you should have a preliminary understanding of regular expressions. Now I think you should be able to solve the problems mentioned at the beginning of the article.

The basic introduction of regular expressions has come to an end here. Although many methods in the re module are used in the code example, I haven't officially introduced this module yet. Considering the length of this article, I will put this part in the next article, which will introduce the common methods of re.

Summary

The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, please leave a message. Thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting started with python crawlers-full guide to regular expressions (5) and full guide to python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Getting started with python crawlers-full guide to regular expressions (5) and full guide to python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support