Title:python3 climbed goddess picture, cracked hotlinking problem
Date:2018-04-22 08:26:00
Tags: [Python3, beauty, picture grabbing, crawler, hotlinking]
Comments:true Preface
In fact, there is no essential difference between grabbing a picture and grabbing a novel, and the steps are the same.
But when the picture is read, it encounters a hotlinking problem. This problem is the longest time spent in the solution.
Environment
Language: Python3
Operating system: Mac 10.12.16
Customization Kit: Soup_tool
Its dependent tools are as follows:
from urllib import requestfrom urllib.parse import quotefrom bs4 import BeautifulSoupimport osimport threadingimport reimport ssl
Version 0.1 single URL crawl all specific image crawl analysis
First open a single beauty picture collection
https://www.nvshens.com/g/24816
You can see what I've labeled.
Using Chrome's check-in feature, you can see the current page with 3 images we want
Where you can use the format of the second picture as a template
Just replace 001 until 002, 003 、......、 044
Look at the number of pieces in total, see my count on it.
This allows you to crawl without paging.
Now, the basic analysis is complete.
Hands-on combat
Because of the previous crawl of the accumulation of novel sites, wrote a tool class, mainly using the request link, BeautifulSoup parsing Web pages, SSL to solve the HTTPS problem
The tool code is not posted, and will end up with a GitHub address for this project.
1. First, initialize, create class
Class Capture:
Referencing custom tool Classes
From Soup_tool Import Soup
From Soup_tool import MyThread
Then define some parameters for initialization
def __init__(self): self.index_page_url = ‘http://www.nvshens.com‘ # 作品内容主页 self.one_page_url = ‘https://www.nvshens.com/g/:key/‘ # root folder self.folder_path = ‘nvshens/‘ # 每个线程的沉睡时间 self.sleep_time = 2 # 后缀 self.file_hz = ‘.img‘
2. Retrieve the Analysis Atlas home page based on key
Next, we have to make this page's access to dynamic, the URL
https://www.nvshens.com/g/24816
24816 in the Search key
Define a method Readpagefromsearch
def readPageFromSearch(self, search_key): """ 根据输入key读取搜索页面 :param search_key: :return: """
method, the first thing to create a root directory
# 创建文件夹 /nvshens/search_key path = self.folder_path + search_key Soup.create_folder(path)
Then open the Beauty Atlas first page, using soup parsing
# 打开搜索页面第1页 page_url = self.one_page_url.replace(‘:key‘, search_key) print(page_url) soup_html = Soup.get_soup(page_url)
From soup to the ID is dinfo Div, and then find the inside span, get the text, and then dispose of "photo" a few words, to get the maximum number of pictures
text_page = soup_html.find("div", {‘id‘: ‘dinfo‘}).find(‘span‘).get_text() print(‘text_page‘, text_page) last = text_page.replace(‘张照片‘, ‘‘) item_size = int(last)
Then, we have to find the template, but the first picture can not ignore, so first get from the first, we first look at the law
# 第1张 https://img.onvshen.com:85/gallery/25366/24816/0.jpg # 第2张 https://img.onvshen.com:85/gallery/25366/24816/001.jpg # 第3张 ttps://img.onvshen.com:85/gallery/25366/24816/002.jpg
So, we should know what to do, after taking the first one, use Soup's Find_next_sibling method to get the next label node
# 第1张图片 image_one = soup_html.find("ul", {‘id‘: ‘hgallery‘}).find(‘img‘) image_one_url = image_one.get(‘src‘) print(‘image_one_url‘, image_one_url) # 第2张图片链接作为模版 image_two = image_one.find_next_sibling() image_two_url = image_two.get(‘src‘) print(‘image_two_url‘, image_two_url)
Then, according to the second chapter of the URL, first with "/" split, take the right-most group number, get "24816/001.jpg", in use "." Split, get suffix, know if it's JPG or png
# https://img.onvshen.com:85/gallery/25366/24816/001.jpg # 24816/001.jpg img_hz = image_two_url.split("/")[-1] # jpg file_hz = img_hz.split(‘.‘)[1] # https://img.onvshen.com:85/gallery/25366 img_mod_url = image_two_url.replace(img_hz, ‘‘)
3. Multi-threaded Read picture link
Defining the Readpagebythread method
Put the previous
- Maximum number of pictures item_size
- File storage directory path
- Template URL Img_mod_url
- File suffix file_hz
All passed in as parameters.
# 多线程读取,每个图片下载都是一个线程def readPageByThread(self, item_size, path, img_mod_url, file_hz): """ :param item_size: 最大图片数 :param path: 文件存放目录 :param img_mod_url: 模板url :param file_hz: 文件后缀 :return: """
Loop item_size, use the Zfill method to the left of the complement 0
# 循环打开图片链接 for item in range(1, item_size): # 左侧补零 1->001,2->002,……,114->114 page = str(item + 1).zfill(3) new_page_url = img_mod_url + page + ‘.‘ + file_hz new_path = path + ‘/‘ + page + ‘.‘ + file_hz print(new_path, ‘---‘, new_page_url)
Using a custom multithreaded method, the thread is collected and parameters are passed into the Readpagetotxt method
t = MyThread(self.readPagetoTxt, (new_page_url, new_path, self.sleep_time), self.readPagetoTxt.__name__) threads.append(t)
Open thread, and join block
for t in threads: t.start() for t in threads: t.join() print(‘all end‘, ctime())
4. Read the picture contents and write
This is the image of the focus of the capture, in the evening to search a lot of content, find the following methods
Urllib.request.urlretrieve
Pro-Test, no effect on cracking hotlinking
So, what about the real broken hotlinking? I found a buddy's article.
Go language grequests+goquery simple crawler, use multi-thread concurrent crawl
There's a piece of code like this.
Headers:map[string]string{ "Referer":"http://www.zngirls.com", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"}})
This heads not only set the User-agent, but also Referer, Ai? What is this? I'll give it a try
Set Referer to our Index_page_url (http://www.nvshens.com), sure enough. Why is it?
The original referer represents a source, which is the site from which the Web server is requested, we set Referer to Http://www.nvshens.com, which is actually requesting access from its own website.
For details, please refer to this buddy's article what is HTTP Referer?
Of course, this is mainly the developer of this site only use Referer to as the judgment of the chain of security, if not referer instead of the other, it will have to re-crack.
Well, write our code, add a Referer property for head, this soup_tool class
_HEAD2 = { # Referer 抓取哪个网站的图片,添加此头部,可以破解盗链 "Referer": "", ‘Accept-language‘: ‘zh-CN,zh;q=0.9‘ , ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36‘}@staticmethoddef open_url(query_url, referer=‘‘): Soup._HEAD2[‘Referer‘] = referer req = request.Request(quote(query_url, safe=‘/:?=‘), headers=Soup._HEAD2) webpage = request.urlopen(req) html = webpage.read() return html @staticmethoddef write_img(query_url, file_name, referer): content = Soup.open_url(query_url, referer) with open(file_name, ‘wb‘) as f: f.write(content)
Back to our capture class:
# 使用Request添加头部的方法,读取图片链接再写入,最重要的是加上Referer
Postscript
And the v0.2, v0.3 version.
This is v0.2 's analysis.
https://www.nvshens.com/gallery/
https://www.nvshens.com/gallery/dudou/
Analytical thinking is the same, no longer elaborate, self-view source
Look at the effect of the download after the rendering
Finally release the Code link
GitHub
Https://github.com/kiok1210/nvshens_img
Reference documents:
Go language grequests+goquery simple crawler, use multi-thread concurrent crawl
What is an HTTP Referer?
Python3 climbed goddess picture, cracked hotlinking problem