Python implements simple crawler functions

Source: Internet
Author: User
Tags naming convention

In our daily surfing the Web page, often see some good-looking pictures, we would like to save these images to download, or users to do desktop wallpaper, or used to make design material.

Our most common practice is to choose Save as by right mouse button. But some pictures of the right mouse button is not saved as an option, there are ways to pass through the tool is intercepted, but this reduces the sharpness of the picture. All right ~! In fact, you are very powerful, right-click to view the page source code.

We can use Python to implement such a simple crawler function, to crawl the code we want locally. Here's a look at how to use Python to implement such a feature.

One, get the entire page data

First we can get the entire page information to download the picture.

getjpg.py

#coding =utf-8import urllibdef gethtml (URL):    page = urllib.urlopen (URL)    html = page.read ()    return htmlhtml = gethtml ("http://tieba.baidu.com/p/2738151262") Print HTML

The Urllib module provides an interface for reading Web page data, and we can read the data on WWW and FTP as if it were a local file. First, we define a gethtml () function:

The Urllib.urlopen () method is used to open a URL address.

The read () method is used to read the data on the URL, pass a URL to the gethtml () function, and download the entire page. The execution program will print out the entire page.

Second, filter the desired data in the page

Python provides a very powerful regular expression, and we need to know a little bit about Python's regular expressions first.

Http://www.cnblogs.com/fnng/archive/2013/05/20/3089816.html

If we Baidu stick to find a few beautiful wallpaper, through to the previous section to view the tool. Found the address of the image, such as: src= "http://imgsrc.baidu.com/forum......jpg" pic_ext= "JPEG"

Modify the code as follows:

Import reimport urllibdef gethtml (URL):    page = urllib.urlopen (URL)    html = page.read ()    return htmldef getimg (HTML):    reg = R ' src= "(. +?\.jpg)" Pic_ext '    Imgre = Re.compile (reg)    imglist = Re.findall (imgre,html)    return imglist         html = gethtml ("http://tieba.baidu.com/p/2460150866") print getimg (HTML)

We also created the getimg () function to filter the desired picture connection in the entire page obtained. The RE module consists mainly of regular expressions:

Re.compile () can compile a regular expression into a regular expression object.

The Re.findall () method reads the data in the HTML that contains the Imgre (regular expression).

Run the script to get the URL address of the entire page that contains the picture.

Third, save the page filter data to the local

Pass the filtered picture address through the for loop and save to local, the code is as follows:

#coding =utf-8import urllibimport redef gethtml (URL):    page = urllib.urlopen (URL)    html = page.read ()    return Htmldef getimg (HTML):    reg = R ' src= "(. +?\.jpg)" Pic_ext '    Imgre = Re.compile (reg)    imglist = Re.findall (imgre , html)    x = 0 for    imgurl in imglist:        urllib.urlretrieve (Imgurl, '%s.jpg '% x)        x+=1html = gethtml ("http ://tieba.baidu.com/p/2460150866 ") print getimg (HTML)

The core here is to use the Urllib.urlretrieve () method to download remote data directly to the local.

The acquired picture connection is traversed through a for loop, in order to make the picture's file name look more canonical and rename it, and the naming convention is added 1 by the X variable. The saved location defaults to the directory where the program resides.

When the program runs, it will see the files downloaded to the local directory.

Python implements simple crawler functions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.