Python example _python for simple crawler functionality

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When we surf the web on a daily basis, we often see some beautiful pictures, we would like to save these pictures to download, or users used to do desktop wallpaper, or to do design material.

Our most common practice is to choose to save as with the right mouse button. But some pictures of the right mouse button is not saved as an option, there is a way to pass through the screenshot tool to intercept, but this reduces the clarity of the picture. All right, ~!. In fact you are very good, right view the page source code.

We can use Python to implement such a simple reptile function, to crawl the code we want to the local. Let's look at how you can use Python to implement such a feature.

One, get the entire page data

First we can get the entire page information to download the picture.
getjpg.py

#coding =utf-8
Import urllib
def gethtml (URL):
  page = urllib.urlopen (URL)
  html = page.read ()
  return HTML

HTML = gethtml ("http://tieba.baidu.com/p/2738151262")
print HTML

The Urllib module provides an interface for reading Web page data, and we can read data on WWW and FTP as we read local files. First, we define a gethtml () function:

The Urllib.urlopen () method is used to open a URL address.

The read () method is used to read the data on the URL, pass a URL to the gethtml () function, and download the entire page. Executing the program will print out the entire page.

Second, filter the data you want in the page
Python provides a very powerful regular expression, and we need to know a little bit about Python regular expressions.

If we Baidu Bar found a few beautiful wallpaper, through to the front of the viewing tool. Found the address of the picture, such as: src= "http://imgsrc.baidu.com/forum......jpg" pic_ext= "JPEG"

Modify the code as follows:

Import re
import urllib

def gethtml (URL):
  page = urllib.urlopen (URL)
  html = page.read ()
  return HTML

def getimg (HTML):
  reg = R ' src= ' (. +?\.jpg) ' Pic_ext '
  imgre = Re.compile (reg)
  imglist = Re.findall (imgre,html)
  return imglist   
  
html = gethtml ("http://tieba.baidu.com/p/2460150866")
print getimg (HTML)

We also created the getimg () function to filter the desired picture connections across the entire page that was fetched. The RE module consists mainly of regular expressions:

Re.compile () can compile a regular expression into a regular expression object.

The Re.findall () method reads data in HTML that contains Imgre (regular expressions).

Running the script will get the URL address of the entire page that contains the picture.

Third, save the page filter data to the local

The filtered picture address is traversed by a for loop and saved to the local code as follows:

#coding =utf-8
Import urllib
import re

def gethtml (URL):
  page = urllib.urlopen (URL)
  html = Page.read () return
  HTML

def getimg (HTML):
  reg = R ' src= ' (. +?\.jpg) ' Pic_ext '
  imgre = Re.compile (reg)
  imglist = Re.findall (imgre,html)
  x = 0
  for Imgurl in imglist:
    urllib.urlretrieve (Imgurl, '%s.jpg ') % x)
    x+=1


html = gethtml ("http://tieba.baidu.com/p/2460150866")

print getimg (HTML)

The core here is to use the Urllib.urlretrieve () method to directly download the remote data to the local.

Iterate over the acquired picture connection through a for loop, renaming the picture's file name in order to make it appear more canonical, and naming the rule by adding 1 to the x variable. The saved location defaults to the directory where the program is stored.

When the program runs, it will see files downloaded to the local directory.

Thank you for reading, I hope to help you, thank you for your support for this site!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python example _python for simple crawler functionality

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support