Python implements an example of pasting pictures and crawlers,

Source: Internet
Author: User
Tags file url

Python implements an example of pasting pictures and crawlers,

Today, I am free to go home and write a post bar image download program. The tool uses PyCharm. This tool is very practical. I started to use Eclipse, but it is not practical to use the class library or other convenient tools, so I finally got a professional python program development tool. The development environment is Python2, because I learned python2 in college.

Step 1:Is to open the cmd command, enter pip install lxml

Step 2:Download a chrome plug-in: used to convert html files into xml and use xpth technology to locate

Press Ctrl + Shift + X on the page to open the plug-in for PAGE analysis.

For example

Fill in xpth on the left of the black box in the figure. The corresponding result is returned on the right. You can see that all the posts on the current page have been crawled. How to Write xpth should be analyzed based on the check elements on the right to find the rule. The method of each website is different, but you can find the same Rule carefully.

Find the rule and match it to start writing code: go

As for the code, I try to mark comments on each line for your convenience.

#-*-Coding: UTF-8-*-import urllibimport urllib2from lxml import etreedef loadPage (url): "function: send a request based on the url to obtain the server response File url: url to be crawled "# print url # Beauty # headers = {" User-Agent ":" Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) appleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 "} request = urllib2.Request (url) html = urllib2.urlopen (request ). read () # parse HTML document as html dom model content = Etree. HTML (html) # print content # returns all matched list sets link_list = content. xpath ('// div [@ class = "t_con cleafix"]/div/a/@ href') # link_list = content. xpath ('// a [@ class = "j_th_tit"]/@ href') for link in link_list: fulllink = "http://tieba.baidu.com" + link # link combination for each post # print link loadImage (fulllink) # retrieve each image in each post to connect def loadImage (link ): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "} request = urllib2.Request (link, headers = headers) html = urllib2.urlopen (request ). read () # parse content = etree. HTML (html) # retrieve the image connection set sent by the master at each layer in the post # link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src') # link_list = content. xpath ('// div [@ class = "post_bubble_middle"]') link_list = content. xpath ('// img [@ class = "BDE_Image" ]/@ Src ') # obtain the connection of each image for link in link_list: print link writeImage (link) def writeImage (link): "function: write html content to local link: Image connection "" # print "saving" + filename headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "} # file write request = urllib2.Request (link, headers = headers) # original image data: image = urllib2.urlopen (request ). rea D () # retrieve the last 10 digits after the connection as the file name filename = link [-10:] # Write to the local disk file with open ("d: \ image \" + filename, "wb") as f: f. write (image) print "downloaded successfully" + filenamedef tiebaSpider (url, beginPage, endPage): "function: paste the crawler scheduler to process the url of each page in combination: beginPage: start page endPage: End page "for page in range (beginPage, endPage + 1): pn = (page-1) * 50 filename = "nth" + str (page) + "page .html" fullurl = url + "& pn =" + str (pn) pri Nt fullurl loadPage (fullurl) # print html print "Thank you for using" if _ name _ = "_ main __": kw = raw_input ("enter the name of the post to be crawled:") beginPage = int (raw_input ("Enter the start page :")) endPage = int (raw_input ("Please enter end page:") url = "http://tieba.baidu.com/f? "Key = urllib. urlencode ({" kw ": kw}) fullurl = url + key tiebaSpider (fullurl, beginPage, endPage)

Run:

We can see that the program runs successfully. Of course, my own process is not smooth sailing, and the code is for reference only.

The above python implementation of a post bar image crawler example is all the content shared by the small Editor. I hope to give you a reference, and I hope you can provide more support for the customer's house.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.