Effect Show:
This template is mainly used for multi-threaded set diagram download, but the general public crawler can not decipher the use of change, annex has a catalogue of examples.
Each use each URL detail will have the difference all the belt (*) belongs to DIY category, need some basic HTML knowledge, please use flexibly.
Add breakpoint continuation function, folder name changed to set map address;
Each Python source code is followed by a detailed comment:
Import Requests # #参考h踢踢批://docs point Python-requests.org/zh_cn/latest/user/quickstart point htmlfrom bs4 Import BeautifulSoup # #参考h踢踢批://beautifulsoup point readthedocs.io/zh_cn/v4.4.0/#id55import os # #本地写入数据import Urllib.request # # Sometimes directly open the image address will show 403 Forbidden, only open the relevant page and then open the picture to normal display, so I opened the page, can omit the import re # #正则表达式, used to match the format from multiprocessing import Pool # #多线程 headers = {' user-agent ': "mozilla/5.0", "Referer": "Gallery Home"} # #浏览器请求头, sometimes python can get a picture directly when the anti-theft chain kicks out, so we pretend to be using the browser def Run (URL): # # (*) The URL of a categorized page in the image start_html = Requests.get (URL, headers=headers) # #request该url的html文件 Soup = BeautifulSoup ( Start_html.text, ' lxml ') # #使用BeautifulSoup来解析我们获取到的网页 (' lxml ' is the specified parser specifically refer to official documentation OH) All_a = soup.find (' div ', class_= ' The subject's class name '). Find_all (' A ') # # (*) Find all the pictures of the body on the page path = Url.split ('/') [-2] # # (*) The last of the URLs is generally known as this category, and can be used as a folder name if not Os.path.exists ("Storage Total directory" + "/" + path): # #如果没有这个文件夹的话, create and enter Os.makedirs ("Storage Total directory" + "/" + path) # #创建一个存放的文件夹 Os.chdir ("Storage Total directory" + "/" + path) # #切换到上面创建的文件夹 for a in all_a:href = a["href"] # # (*) Gets the URL of a set of Web pages that can be omittedElem = a.img[' src '] # # (*) Get this picture address folder = Elem.split ('/') [-2] # # (*) Gets the name of the set of figures length = A.next_sibling.next_sibling.get_te XT () Max_span = Int (length[-17:-14]) # # (*) Number of pages found for the set of graphs HTML = requests.get (href, headers=headers, Allow_redirects=fal SE) # #访问套图网页 and block redirection (also one of the anti-theft chains) U = urllib.request.urlopen (href) # #真的打开这个网页, can omit for page in range (1, Max_span + 1): Page_u RL = elem[:-5] + str (page) + ". jpg" # # (*) Image Address format, you need to explore print (Page_url) # # (*) to print a piece of address, can omit img_html = Requests.get (page_url , Headers=headers, Allow_redirects=false) # #访问图片地址 name = folder + '-' + str (page) # # (*) Picture name format, set diagram name + first few figures F = open (name+ '). JPG ', ' ab ') # #写入这个图片 F.write (img_html.content) # #多媒体文件要用. Content Write F.close () urls = {' Url1 ', ' url2 ', ' Url3 '} # #这就是各分类的url P Ool = Pool ($) # #线程数for URL in Urls:pool.apply_async (Run, args= (URL)) pool.close () Pool.join () print (' All pictures are finished ')
A line of Python code a comment, a large number of beautiful sets of pictures, such as Meng New to fight!