Python3 climbed goddess picture, cracked hotlinking problem

Source: Internet
Author: User

Title:python3 climbed goddess picture, cracked hotlinking problem

Date:2018-04-22 08:26:00

Tags: [Python3, beauty, picture grabbing, crawler, hotlinking]

Comments:true Preface

In fact, there is no essential difference between grabbing a picture and grabbing a novel, and the steps are the same.

But when the picture is read, it encounters a hotlinking problem. This problem is the longest time spent in the solution.

Environment

Language: Python3

Operating system: Mac 10.12.16

Customization Kit: Soup_tool

Its dependent tools are as follows:

from urllib import requestfrom urllib.parse import quotefrom bs4 import BeautifulSoupimport osimport threadingimport reimport ssl
Version 0.1 single URL crawl all specific image crawl analysis

First open a single beauty picture collection
https://www.nvshens.com/g/24816

You can see what I've labeled.

Using Chrome's check-in feature, you can see the current page with 3 images we want

Where you can use the format of the second picture as a template

Just replace 001 until 002, 003 、......、 044

Look at the number of pieces in total, see my count on it.

This allows you to crawl without paging.

Now, the basic analysis is complete.

Hands-on combat

Because of the previous crawl of the accumulation of novel sites, wrote a tool class, mainly using the request link, BeautifulSoup parsing Web pages, SSL to solve the HTTPS problem

The tool code is not posted, and will end up with a GitHub address for this project.

1. First, initialize, create class

Class Capture:

Referencing custom tool Classes

From Soup_tool Import Soup

From Soup_tool import MyThread

Then define some parameters for initialization

def __init__(self):    self.index_page_url = ‘http://www.nvshens.com‘    # 作品内容主页    self.one_page_url = ‘https://www.nvshens.com/g/:key/‘    # root folder    self.folder_path = ‘nvshens/‘    # 每个线程的沉睡时间    self.sleep_time = 2    # 后缀    self.file_hz = ‘.img‘
2. Retrieve the Analysis Atlas home page based on key

Next, we have to make this page's access to dynamic, the URL

https://www.nvshens.com/g/24816

24816 in the Search key

Define a method Readpagefromsearch

def readPageFromSearch(self, search_key):    """    根据输入key读取搜索页面    :param search_key:    :return:    """   

method, the first thing to create a root directory

     # 创建文件夹 /nvshens/search_key    path = self.folder_path + search_key    Soup.create_folder(path)   

Then open the Beauty Atlas first page, using soup parsing

     # 打开搜索页面第1页    page_url = self.one_page_url.replace(‘:key‘, search_key)    print(page_url)    soup_html = Soup.get_soup(page_url)    

From soup to the ID is dinfo Div, and then find the inside span, get the text, and then dispose of "photo" a few words, to get the maximum number of pictures

    text_page = soup_html.find("div", {‘id‘: ‘dinfo‘}).find(‘span‘).get_text()    print(‘text_page‘, text_page)    last = text_page.replace(‘张照片‘, ‘‘)    item_size = int(last)      

Then, we have to find the template, but the first picture can not ignore, so first get from the first, we first look at the law

   # 第1张    https://img.onvshen.com:85/gallery/25366/24816/0.jpg   # 第2张   https://img.onvshen.com:85/gallery/25366/24816/001.jpg   # 第3张   ttps://img.onvshen.com:85/gallery/25366/24816/002.jpg

So, we should know what to do, after taking the first one, use Soup's Find_next_sibling method to get the next label node

    # 第1张图片    image_one = soup_html.find("ul", {‘id‘: ‘hgallery‘}).find(‘img‘)    image_one_url = image_one.get(‘src‘)    print(‘image_one_url‘, image_one_url)    # 第2张图片链接作为模版    image_two = image_one.find_next_sibling()    image_two_url = image_two.get(‘src‘)    print(‘image_two_url‘, image_two_url)

Then, according to the second chapter of the URL, first with "/" split, take the right-most group number, get "24816/001.jpg", in use "." Split, get suffix, know if it's JPG or png

    # https://img.onvshen.com:85/gallery/25366/24816/001.jpg     # 24816/001.jpg    img_hz = image_two_url.split("/")[-1]    # jpg    file_hz = img_hz.split(‘.‘)[1]    # https://img.onvshen.com:85/gallery/25366    img_mod_url = image_two_url.replace(img_hz, ‘‘)    
3. Multi-threaded Read picture link

Defining the Readpagebythread method
Put the previous

    • Maximum number of pictures item_size
    • File storage directory path
    • Template URL Img_mod_url
    • File suffix file_hz

All passed in as parameters.

# 多线程读取,每个图片下载都是一个线程def readPageByThread(self, item_size, path, img_mod_url, file_hz):    """    :param item_size: 最大图片数    :param path: 文件存放目录    :param img_mod_url: 模板url     :param file_hz: 文件后缀     :return:    """    

Loop item_size, use the Zfill method to the left of the complement 0

    # 循环打开图片链接    for item in range(1, item_size):        # 左侧补零 1->001,2->002,……,114->114        page = str(item + 1).zfill(3)        new_page_url = img_mod_url + page + ‘.‘ + file_hz        new_path = path + ‘/‘ + page + ‘.‘ + file_hz        print(new_path, ‘---‘, new_page_url)          

Using a custom multithreaded method, the thread is collected and parameters are passed into the Readpagetotxt method

        t = MyThread(self.readPagetoTxt, (new_page_url, new_path, self.sleep_time), self.readPagetoTxt.__name__)        threads.append(t)        

Open thread, and join block

    for t in threads:        t.start()    for t in threads:        t.join()    print(‘all end‘, ctime())
4. Read the picture contents and write

This is the image of the focus of the capture, in the evening to search a lot of content, find the following methods

Urllib.request.urlretrieve

Pro-Test, no effect on cracking hotlinking

So, what about the real broken hotlinking? I found a buddy's article.

Go language grequests+goquery simple crawler, use multi-thread concurrent crawl

There's a piece of code like this.

  Headers:map[string]string{                              "Referer":"http://www.zngirls.com",                              "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"}})                              

This heads not only set the User-agent, but also Referer, Ai? What is this? I'll give it a try

Set Referer to our Index_page_url (http://www.nvshens.com), sure enough. Why is it?

The original referer represents a source, which is the site from which the Web server is requested, we set Referer to Http://www.nvshens.com, which is actually requesting access from its own website.

For details, please refer to this buddy's article what is HTTP Referer?

Of course, this is mainly the developer of this site only use Referer to as the judgment of the chain of security, if not referer instead of the other, it will have to re-crack.

Well, write our code, add a Referer property for head, this soup_tool class

_HEAD2 = {    # Referer 抓取哪个网站的图片,添加此头部,可以破解盗链    "Referer": "",    ‘Accept-language‘: ‘zh-CN,zh;q=0.9‘    ,    ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36‘}@staticmethoddef open_url(query_url, referer=‘‘):    Soup._HEAD2[‘Referer‘] = referer    req = request.Request(quote(query_url, safe=‘/:?=‘), headers=Soup._HEAD2)    webpage = request.urlopen(req)    html = webpage.read()    return html    @staticmethoddef write_img(query_url, file_name, referer):    content = Soup.open_url(query_url, referer)    with open(file_name, ‘wb‘) as f:        f.write(content)

Back to our capture class:

    # 使用Request添加头部的方法,读取图片链接再写入,最重要的是加上Referer    
Postscript

And the v0.2, v0.3 version.

This is v0.2 's analysis.

https://www.nvshens.com/gallery/

https://www.nvshens.com/gallery/dudou/

Analytical thinking is the same, no longer elaborate, self-view source

Look at the effect of the download after the rendering

Finally release the Code link

GitHub

Https://github.com/kiok1210/nvshens_img

Reference documents:

Go language grequests+goquery simple crawler, use multi-thread concurrent crawl

What is an HTTP Referer?

Python3 climbed goddess picture, cracked hotlinking problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.