Python-implement a post bar image crawler,

Source: Internet
Author: User

Python-implement a post bar image crawler,

Today, I am free to go home and write a post bar image download program. The tool uses PyCharm. This tool is very practical. I started to use Eclipse, but it is not practical to use the class library or other convenient tools, so I finally got a professional python program development tool. The development environment is Python2, because I learned python2 in college.

Step 1: Open the cmd command and enter pip install lxml

Step 2: Download a chrome plug-in: specifically used to convert html files into xml and use xpth technology to locate

 

Press Ctrl + Shift + X on the page to open the plug-in for PAGE analysis.

For example

 

Fill in xpth on the left of the black box in the figure. The corresponding result is returned on the right. You can see that all the posts on the current page have been crawled. How to Write xpth should be analyzed based on the check elements on the right to find the rule. The method of each website is different, but you can find the same Rule carefully.

Find the rule and match it to start writing code: go

As for the code, I try to mark comments on each line for your convenience.

#-*-Coding: UTF-8 -*-

Import urllib
Import urllib2
From lxml import etree


Def loadPage (url ):
"""
Purpose: send a request based on the url to obtain the server response file.
Url: the url to be crawled
"""
# Print url
# Beauty
# Headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 "}

Request = urllib2.Request (url)
Html = urllib2.urlopen (request). read ()
# Parsing HTML documents as html dom Models
Content = etree. HTML (html)
# Print content
# Returns a list Of all matched items
Link_list = content. xpath ('// div [@ class = "t_con cleafix"]/div/a/@ href ')

# Link_list = content. xpath ('// a [@ class = "j_th_tit"]/@ href ')
For link in link_list:
Fulllink = "http://tieba.baidu.com" + link
# Combine the links of each post
# Print link
LoadImage (fulllink)


# Retrieve each image connection in each post
Def loadImage (link ):
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
Request = urllib2.Request (link, headers = headers)
Html = urllib2.urlopen (request). read ()
# Parsing
Content = etree. HTML (html)
# Retrieve the image connection set sent by each layer in the post
# Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Link_list = content. xpath ('// div [@ class = "post_bubble_middle"]')
Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Retrieve the connection of each image
For link in link_list:
Print link
WriteImage (link)


Def writeImage (link ):
"""
Purpose: write html content locally.
Link: Image connection
"""
# Print "saving" + filename
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
# File writing
Request = urllib2.Request (link, headers = headers)
# Original Image Data
Image = urllib2.urlopen (request). read ()
# Retrieve the last 10 digits of the connection as the file name
Filename = link [-10:]
# Write to a local disk file
With open ("d: \ image \" + filename, "wb") as f:
F. write (image)
Print "downloaded successfully" + filename

Def tiebaSpider (url, beginPage, endPage ):
"""
Purpose: paste the crawler scheduler to process the URLs of each page.
Url: The first part of the Post url
BeginPage: start page
EndPage: End page
"""
For page in range (beginPage, endPage + 1 ):
Pn = (page-1) * 50
Filename = "no." + str (page) + "page .html"
Fullurl = url + "& pn =" + str (pn)
Print fullurl
LoadPage (fullurl)
# Print html

Print "Thank you for using"

If _ name _ = "_ main __":
Kw = raw_input ("enter the name of the post to be crawled :")
BeginPage = int (raw_input ("Enter the start page :"))
EndPage = int (raw_input ("Enter the end page :"))

Url = "http://tieba.baidu.com/f? "
Key = urllib. urlencode ({"kw": kw })
Fullurl = url + key
TiebaSpider (fullurl, beginPage, endPage)
Run:

 

We can see that the program runs successfully. Of course, my own process is not smooth sailing, and the code is for reference only.

 







 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.