Crawler Basics: Using regular matching to get the specified content in a Web page

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper illustrates the basic functions of a crawler by crawling the pictures of the travel class in the National Geographic Chinese network. Given the initial address

National Geographic Chinese network: http://www.ngchina.com.cn/travel/ Get and analyze Web page content

A, analysis of the Web page structure, to determine the content of the desired part

We open the Web page, right click on the "Show Web page source code" to view the structure of the page, here is the section I intercepted

We will find that the image type data is placed in the < IMG > tag scr= "", we just find these tags, from which we can extract the connection we want to complete our expectations.

b, get the content of the Web page

To extract the content, we first initiate a request to the server, fetch the file, then analyze and extract the image information, and organize the data to be saved.

The author uses the Python3.6, obtains the webpage content commonly has two kinds of ways: Requests and urllib (merges Urllib and Urllib2 in Python2), regarding obtains the webpage content, please refer to this: The Reptile Foundation: Python Obtains the webpage content

Now, we define a way to get the Web page crawl ()

Import Requests
def crawl (URL, headers): With
    requests.get (Url=url, headers=headers) as response:
        # Read the contents of the response, and transcoding
        data = Response.content.decode () return
        data

Call this method to get the contents of the Web page:

# get the specified page content
url = ' http://www.ngchina.com.cn/travel/'
headers = {' user-agent ': ' mozilla/5.0 Linux; Android 6.0; Nexus 5 build/mra58n) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.139 Mobile safari/537.36 '}

content = CR Awl (URL, headers)
print (content)

write regular expressions that match the contents of a picture

Import re
# Writing regular expressions, matching picture content pattern
= R ' src= ' (. *?\.jpg) '
complie_re = Re.compile
Match all content that matches the regular expression
imglist = Complie_re.findall (content)

# for the matching contents, may need further processing, filtering to finally get the data we need, such as new link, We can go back to the new URL queue, wait for the next crawl
# in short, deal with it according to requirements

# to
imglist = List (set (imglist))

# Remove unqualified pictures
imglist = [img for img in imglist if Img.startswith (' http ')]

# Output for
img, i in zip (imglist, Range (len (imglist))):
    Print (' {}:{} '. Format (i, IMG))

'
0:http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg
1:http://image.ngchina.com.cn/2018/0130/20180130032001381.jpg
2:http://image.ngchina.com.cn/2018/0424/ 20180424010923371.jpg
...
37:http://image.ngchina.com.cn/2018/0419/20180419014117124.jpg
38:http://image.nationalgeographic.com.cn/ 2017/1127/20171127121516360.jpg
' "

So we grabbed the picture information from the given address, and we picked one:

Http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg

storage and next round of crawl

After we crawl to the specified content, we can store it in the database, if it is link type crawl, we can create a URL queue, the specified URL to add new link to the URL queue, and then round the traversal crawl, for the Queue URL processing, The corresponding strategy needs to be adopted according to the specific requirements to accomplish the task. More crawler information can be referred to: the initial crawler. Add:

When we write regular expressions, we can use the online regular expression tool to quickly see the results of the match: Rookie regular tool, this address there are some commonly used, has been written in the regular expression, such as telephone, QQ number, Web site, mailbox, etc., very good.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler Basics: Using regular matching to get the specified content in a Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler Basics: Using regular matching to get the specified content in a Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support