Python-implement a post bar image crawler,

Last Update:2017-10-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, I am free to go home and write a post bar image download program. The tool uses PyCharm. This tool is very practical. I started to use Eclipse, but it is not practical to use the class library or other convenient tools, so I finally got a professional python program development tool. The development environment is Python2, because I learned python2 in college.

Step 1: Open the cmd command and enter pip install lxml

Step 2: Download a chrome plug-in: specifically used to convert html files into xml and use xpth technology to locate

Press Ctrl + Shift + X on the page to open the plug-in for PAGE analysis.

For example

Fill in xpth on the left of the black box in the figure. The corresponding result is returned on the right. You can see that all the posts on the current page have been crawled. How to Write xpth should be analyzed based on the check elements on the right to find the rule. The method of each website is different, but you can find the same Rule carefully.

Find the rule and match it to start writing code: go

As for the code, I try to mark comments on each line for your convenience.

#-*-Coding: UTF-8 -*-

Import urllib
Import urllib2
From lxml import etree


Def loadPage (url ):
"""
Purpose: send a request based on the url to obtain the server response file.
Url: the url to be crawled
"""
# Print url
# Beauty
# Headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 "}

Request = urllib2.Request (url)
Html = urllib2.urlopen (request). read ()
# Parsing HTML documents as html dom Models
Content = etree. HTML (html)
# Print content
# Returns a list Of all matched items
Link_list = content. xpath ('// div [@ class = "t_con cleafix"]/div/a/@ href ')

# Link_list = content. xpath ('// a [@ class = "j_th_tit"]/@ href ')
For link in link_list:
Fulllink = "http://tieba.baidu.com" + link
# Combine the links of each post
# Print link
LoadImage (fulllink)


# Retrieve each image connection in each post
Def loadImage (link ):
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
Request = urllib2.Request (link, headers = headers)
Html = urllib2.urlopen (request). read ()
# Parsing
Content = etree. HTML (html)
# Retrieve the image connection set sent by each layer in the post
# Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Link_list = content. xpath ('// div [@ class = "post_bubble_middle"]')
Link_list = content. xpath ('// img [@ class = "BDE_Image"]/@ src ')
# Retrieve the connection of each image
For link in link_list:
Print link
WriteImage (link)


Def writeImage (link ):
"""
Purpose: write html content locally.
Link: Image connection
"""
# Print "saving" + filename
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36 "}
# File writing
Request = urllib2.Request (link, headers = headers)
# Original Image Data
Image = urllib2.urlopen (request). read ()
# Retrieve the last 10 digits of the connection as the file name
Filename = link [-10:]
# Write to a local disk file
With open ("d: \ image \" + filename, "wb") as f:
F. write (image)
Print "downloaded successfully" + filename

Def tiebaSpider (url, beginPage, endPage ):
"""
Purpose: paste the crawler scheduler to process the URLs of each page.
Url: The first part of the Post url
BeginPage: start page
EndPage: End page
"""
For page in range (beginPage, endPage + 1 ):
Pn = (page-1) * 50
Filename = "no." + str (page) + "page .html"
Fullurl = url + "& pn =" + str (pn)
Print fullurl
LoadPage (fullurl)
# Print html

Print "Thank you for using"

If _ name _ = "_ main __":
Kw = raw_input ("enter the name of the post to be crawled :")
BeginPage = int (raw_input ("Enter the start page :"))
EndPage = int (raw_input ("Enter the end page :"))

Url = "http://tieba.baidu.com/f? "
Key = urllib. urlencode ({"kw": kw })
Fullurl = url + key
TiebaSpider (fullurl, beginPage, endPage)
Run:

We can see that the program runs successfully. Of course, my own process is not smooth sailing, and the code is for reference only.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python-implement a post bar image crawler,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python-implement a post bar image crawler,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support