Crawler Basics: Using regular matching to get the specified content in a Web page

Source: Internet
Author: User

This paper illustrates the basic functions of a crawler by crawling the pictures of the travel class in the National Geographic Chinese network. Given the initial address

National Geographic Chinese network: http://www.ngchina.com.cn/travel/ Get and analyze Web page content

A, analysis of the Web page structure, to determine the content of the desired part

We open the Web page, right click on the "Show Web page source code" to view the structure of the page, here is the section I intercepted

We will find that the image type data is placed in the < IMG > tag scr= "", we just find these tags, from which we can extract the connection we want to complete our expectations.

b, get the content of the Web page

To extract the content, we first initiate a request to the server, fetch the file, then analyze and extract the image information, and organize the data to be saved.

The author uses the Python3.6, obtains the webpage content commonly has two kinds of ways: Requests and urllib (merges Urllib and Urllib2 in Python2), regarding obtains the webpage content, please refer to this: The Reptile Foundation: Python Obtains the webpage content

Now, we define a way to get the Web page crawl ()

Import Requests
def crawl (URL, headers): With
    requests.get (Url=url, headers=headers) as response:
        # Read the contents of the response, and transcoding
        data = Response.content.decode () return
        data

Call this method to get the contents of the Web page:

# get the specified page content
url = ' http://www.ngchina.com.cn/travel/'
headers = {' user-agent ': ' mozilla/5.0 Linux; Android 6.0; Nexus 5 build/mra58n) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.139 Mobile safari/537.36 '}

content = CR Awl (URL, headers)
print (content)
write regular expressions that match the contents of a picture
Import re
# Writing regular expressions, matching picture content pattern
= R ' src= ' (. *?\.jpg) '
complie_re = Re.compile
Match all content that matches the regular expression
imglist = Complie_re.findall (content)

# for the matching contents, may need further processing, filtering to finally get the data we need, such as new link, We can go back to the new URL queue, wait for the next crawl
# in short, deal with it according to requirements

# to
imglist = List (set (imglist))

# Remove unqualified pictures
imglist = [img for img in imglist if Img.startswith (' http ')]

# Output for
img, i in zip (imglist, Range (len (imglist))):
    Print (' {}:{} '. Format (i, IMG))

'
0:http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg
1:http://image.ngchina.com.cn/2018/0130/20180130032001381.jpg
2:http://image.ngchina.com.cn/2018/0424/ 20180424010923371.jpg
...
37:http://image.ngchina.com.cn/2018/0419/20180419014117124.jpg
38:http://image.nationalgeographic.com.cn/ 2017/1127/20171127121516360.jpg
' "

So we grabbed the picture information from the given address, and we picked one:

Http://image.ngchina.com.cn/2018/0428/20180428110510703.jpg

storage and next round of crawl

After we crawl to the specified content, we can store it in the database, if it is link type crawl, we can create a URL queue, the specified URL to add new link to the URL queue, and then round the traversal crawl, for the Queue URL processing, The corresponding strategy needs to be adopted according to the specific requirements to accomplish the task. More crawler information can be referred to: the initial crawler. Add:

When we write regular expressions, we can use the online regular expression tool to quickly see the results of the match: Rookie regular tool, this address there are some commonly used, has been written in the regular expression, such as telephone, QQ number, Web site, mailbox, etc., very good.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.