Python-implement a post bar image crawler,
Today, I am free to go home and write a post bar image download program. The tool uses PyCharm. This tool is very practical. I started to use Eclipse, but it is not practical to use the class library or other convenient tools, so I finally got a professional python program development tool. The development environment is Python2, because I learned python2 in college.
Step 1: Open the cmd command and enter pip install lxml
Step 2: Download a chrome plug-in: specifically used to convert html files into xml and use xpth technology to locate
Press Ctrl + Shift + X on the page to open the plug-in for PAGE analysis.
For example
Fill in xpth on the left of the black box in the figure. The corresponding result is returned on the right. You can see that all the posts on the current page have been crawled. How to Write xpth should be analyzed based on the check elements on the right to find the rule. The method of each website is different, but you can find the same Rule carefully.
Find the rule and match it to start writing code: go
As for the code, I try to mark comments on each line for your convenience.
#-*-Coding: UTF-8 -*-
Import urllib
Import urllib2
From lxml import etree
Def loadPage (url ):
"""
Purpose: send a request based on the url to obtain the server response file.
Url: the url to be crawled
"""
# Print url
# Beauty
# Headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 "}
Request = urllib2.Request (url)
Html = urllib2.urlopen (request). read ()
# Parsing HTML documents as html dom Models
Content = etree. HTML (html)
# Print content
# Returns a list Of all matched items
Link_list = content. xpath ('// div [@ class = "t_con cleafix"]/div/a/@ href ')
# Link_list = content. xpath ('// a [@ class = "j_th_tit"]/@ href ')
For link in link_list:
Fulllink = "http://tieba.baidu.com" + link
# Combine the links of each post
# Print link
LoadImage (fulllink)
# Retrieve each image connection in each post
Def loadImage (link ):
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
Request = urllib2.Request (link, headers = headers)
Html = urllib2.urlopen (request). read ()
# Parsing
Content = etree. HTML (html)
# Retrieve the image connection set sent by each layer in the post
# Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Link_list = content. xpath ('// div [@ class = "post_bubble_middle"]')
Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Retrieve the connection of each image
For link in link_list:
Print link
WriteImage (link)
Def writeImage (link ):
"""
Purpose: write html content locally.
Link: Image connection
"""
# Print "saving" + filename
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
# File writing
Request = urllib2.Request (link, headers = headers)
# Original Image Data
Image = urllib2.urlopen (request). read ()
# Retrieve the last 10 digits of the connection as the file name
Filename = link [-10:]
# Write to a local disk file
With open ("d: \ image \" + filename, "wb") as f:
F. write (image)
Print "downloaded successfully" + filename
Def tiebaSpider (url, beginPage, endPage ):
"""
Purpose: paste the crawler scheduler to process the URLs of each page.
Url: The first part of the Post url
BeginPage: start page
EndPage: End page
"""
For page in range (beginPage, endPage + 1 ):
Pn = (page-1) * 50
Filename = "no." + str (page) + "page .html"
Fullurl = url + "& pn =" + str (pn)
Print fullurl
LoadPage (fullurl)
# Print html
Print "Thank you for using"
If _ name _ = "_ main __":
Kw = raw_input ("enter the name of the post to be crawled :")
BeginPage = int (raw_input ("Enter the start page :"))
EndPage = int (raw_input ("Enter the end page :"))
Url = "http://tieba.baidu.com/f? "
Key = urllib. urlencode ({"kw": kw })
Fullurl = url + key
TiebaSpider (fullurl, beginPage, endPage)
Run:
We can see that the program runs successfully. Of course, my own process is not smooth sailing, and the code is for reference only.