Python simple crawler crawl Baidu post photos

Source: Internet
Author: User

  

Using Python to achieve such a simple crawler function, the image we want to crawl to local. (Python version is 3.6.0)

I. Getting the entire page data

  

def gethtml (URL):    page=urllib.request.urlopen (URL)    html=page.read ()      return HTML

Description

By passing a URL to the gethtml () function, you can download the entire page.
The Urllib.request module provides an interface for reading Web page data, and we can read the data on WWW and FTP as if it were a local file.

Two. Filter the data you want in the page

In Baidu Bar found a few beautiful pictures, want to download down. Using Firefox, right-click the View elements option in the image location, click inside to enter developer mode, and navigate to the front code of the image

Now we mainly look at the regular characteristics of images and write regular expressions.

Reg=r'src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "'
#参考正则

Writing code

def getimg (HTML):    reg=r'src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "' = Re.compile (reg)    = Re.findall (Imgre,html.decode ('utf-8'        ))    return imglist

Description

Re.compile () can compile a regular expression into a regular expression object.

The Re.findall () method reads the data in the HTML that contains the Imgre (regular expression).

Run the script to get the URL address of the entire page that contains the picture.

Three. Save page-filtered data to a local

Write a saved function

def saveFile (x):     if Not Os.path.isdir (path):        os.makedirs (path)    = os.path.join (path,'%s.img' %x)    return  t

Full code:

" "Created on July 15, 2017 @author:administrator" "Import urllib.request,osimport redef gethtml (URL): page=urllib.request.urlopen (URL) HTML=Page.read ()returnHtmlpath='d:/workspace/python1/reptile/__pycache__/img'def saveFile (x):ifNot Os.path.isdir (path): os.makedirs (path) T= Os.path.join (Path,'%s.img'%x)returnthtml=gethtml ('https://tieba.baidu.com/p/5248432620') print (HTML) print ('\ n') def getimg (HTNL): Reg=r'src= "(https://imgsa[^>]+\. (?: jpeg|jpg))"'Imgre=Re.compile (reg) Imglist=re.findall (Imgre,html.decode ('Utf-8')) x=0     forImgurlinchImglist:urllib.request.urlretrieve (Imgurl,savefile (x)) print (imgurl) x+=1        ifx== at:             Breakprint (x)returnimglistgetimg (HTML) print ('End')

The core is the use of the urllib.request. Urlretrieve () method to directly download remote data to a local

Finally, a bit of a problem has not been fully resolved, here is also to ask you.

When you download more than 23 photos, you will get an error:

Urllib.error.HTTPError:HTTP Error 500:internal Server error
Don't know what the problem is, ask for help.

Python simple crawler crawl Baidu post photos

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.