Order
A simple Web page and a disguised browser have already been completed in the previous study. Below, the implementation of the Watercress homepage All pictures crawl program, save the picture to a local path.
First, the homepage part of the photo show
This is only part of the interception. Below is given, the entire reptile program.
Reptile Program
The image-crawling program uses a disguised browser, except for the module that handles the image.
"" Bulk download the image of the homepage of the watercress using a disguised browser to crawl Douban station home image, save to the specified path folder "#导入所需的库import urllib.request,socket,re,sys,os# Definition file Save path TargetPath = "E:\\projects\\spider\\03_dbimages" def saveFile (path): #检测当前路径的有效性 if not Os.path.isdir (TargetPath): os.mkdir (TargetPath) #设置每个图片的路径 pos = path.rindex ('/') t = Os.path.join (targetpath,path[pos+1:]) Return t# Use if __name__ = = ' __main__ ' to determine if the. py file is running directly in the # URL = "https://www.douban.com/" headers = { ' user-agent ': ' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) ' chrome/51.0.2704.63 safari/537.36 ' }req = Urllib.request.Request (Url=url, headers=headers) res = Urllib.request.urlopen (req) data = Res.read () for link,t in set ( Re.findall (R ' (https:[^s]*? ( jpg|png|gif)) ', str (data)): print (link) try: Urllib.request.urlretrieve (Link,savefile) except: print (' failed ')
Crawl Results
(1) Print out the information
(2) List of crawled pictures
Can be compared with the homepage of the watercress.
Python3 Crawler Example (iii)--Crawl The Watercress homepage image