Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

Source: Internet
Author: User

"Insert picture, sister figure Home"

Ha, only dare to put this to the point.
Today give straight men a little benefit, through today's code, you can put your hard drive full of ~
Here we go!

First step: How to get a picture

If we know the URL of an image, how do we get to this picture?
Let's look at the simplest way:
"Insert picture, single page URL"

We get to the content of the picture, write it to the file through the binary stream, and save it up.
This time lazy, all the pictures are saved in the current directory.

import requestsurl=‘http://i.meizitu.net/2017/11/24a02.jpg‘pic=requests.get(url).contentpic_name=url.split(r‘/‘)[-1]withopen(pic_name,‘wb‘as f:    f.write(pic)

Unexpectedly error, get the picture unexpectedly is like this:
"Insert picture, anti-crawler-anti-theft chain"

It seems that all kinds of web sites are starting to reverse the reptile, we try to add some camouflage information.
It is estimated that a lot of riot young people take the girl to practice reptiles ...
After exploring, add Referer header information, in order to continue to download, this referer corresponding content is the parent address of the image, that is, the address of the page where this image is stored.
OK, we get the picture, the code is as follows, made some changes, do not directly open the image address, but open a single page picture of the page where the address.

ImportRequestsImportRepic_parent=' HTTP://WWW.MZITU.COM/109942/2 'defSave_one_pic (pic_parent): Pic_path=Pic_parent.split (R '/')[-2] Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} headers[' Referer ']=Pic_parent#新增属性, otherwise you won't get the picture.Html=Requests.get (pic_parent,headers=Headers). Text pattern=Re.Compile(R '  ") Pic_url=Re.findall (pattern,html) [0]#匹配单页图片地址Pic=Requests.get (Pic_url, headers=Headers). Content Pic_name=Pic_path+' _ '+Pic_url.split (R '/')[-1] with Open(Pic_name,' WB ') asF:f.write (pic)if __name__==' __main__ ': Save_one_pic (Pic_parent)
Step two: Get all the picture URLs of a model and save

The first step, we learned to save a page of pictures.
But every model will usually have dozens of photos, if you get all the photos of the model at one time?
"Insert picture, Model multi-page"

In fact, it is very convenient, like the previous project, as long as the URL to do a little modification of the good
Http://www.mzitu.com/110107/3
The last one to change is OK.
Of course, we first have to get the total number of pictures of the model, or match it with a regular one.
The code is as follows:

defGet_one_volume_pic (Pic_volume_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Pic_volume_url, headers=Headers). Text pattern=Re.Compile(r "(. *) <span> (\d+?) </span></a><a href= ' (. *?) ' ><span> Next Page ") Max_no=int(Re.findall (pattern,html) [0][-2])#对list进行操作, you can use Baidu a bitFirst_Name=Pic_volume_url.split ('/')[-1]#print (first_name)    #print (max_no)    Print('--Start saving: ', first_name) p=Pool () p.Map(Save_one_pic,[pic_volume_url+'/'+Str(i) forIinch Range(1, Max_no+1)])# for I in Range (max_no+1):    # URL=PIC_VOLUME_URL+STR (i)    # save_one_pic (URL)    Print('--', first_name,', Save complete ')
Step three: Get a picture of all the models on a page

"Insert Picture, homepage multi-page"

http://www.mzitu.com/page/2/
Each page has a collection of pictures of the models, and we want to get the URL entry for each model on this page, in order to use the second step to download the image.
"Insert picture, model URL entry"

#保存基本页面上的第pageNo页def get_one_volume_all_pic(page_No): url=base_url+str(page_No) headers = {‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0‘} html = requests.get(url, headers=headers).text pattern=re.compile(r‘<li><a href="(.*?)" target‘) url_list=re.findall(pattern,html) #print(url_list) print(‘开始保存第‘,page_No,‘页!‘) for url in url_list: get_one_volume_pic(url)
Fourth step: Get the maximum number of pages in the page

As the number of pictures is huge, we try not to download them all at once.
To download a specified number of pages, you need to get the number of pages.
This method is similar to the method in the second step, please refer to.

Base_url=' http://www.mzitu.com/page/'#需要加页码defGet_max_page (base_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Base_url+' 1 ', headers=Headers). Text#print (HTML)Pattern=Re.Compile(r "<a class= ' page-numbers ' (. *?) </span> (\d+?) <span class= (. *?) ></span></a> ") Max_no=Re.findall (pattern,html) [-1][-2]#print (max_no)    returnMax_no
All code

Finally finished, although the code is not much, constantly in error, debugging, or very time-consuming.
Now all the pictures are saved under the current folder, the following to improve, each volume to build a separate folder, so it looks good.

ImportRequestsImportRe fromMultiprocessingImportPoolbase_url=' http://www.mzitu.com/page/'#需要加页码defSave_one_pic (pic_parent): Pic_path=Pic_parent.split (R '/')[-2] Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (pic_parent,headers=Headers). Text pattern=Re.Compile(R '  ") Pic_url=Re.findall (pattern,html) [0] Headers[' Referer ']=Pic_parent# Add properties, otherwise you won't get the picturePic=Requests.get (Pic_url, headers=Headers). Content Pic_name=Pic_path+' _ '+Pic_url.split (R '/')[-1] with Open(Pic_name,' WB ') asF:f.write (pic)Print('------saved successfully: ', Pic_name)defGet_one_volume_pic (Pic_volume_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Pic_volume_url, headers=Headers). Text pattern=Re.Compile(r "(. *) <span> (\d+?) </span></a><a href= ' (. *?) ' ><span> Next Page ") Max_no=int(Re.findall (pattern,html) [0][-2]) first_name=Pic_volume_url.split ('/')[-1]#print (first_name)    #print (max_no)    Print('--Start saving: ', first_name) p=Pool () p.Map(Save_one_pic,[pic_volume_url+'/'+Str(i) forIinch Range(1, Max_no+1)])# for I in Range (max_no+1):    # URL=PIC_VOLUME_URL+STR (i)    # save_one_pic (URL)    Print('--', first_name,', Save complete ')#保存基本页面上的第pageNo页defGet_one_volume_all_pic (page_no): URL=Base_url+Str(page_no) headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (URL, headers=Headers). Text pattern=Re.Compile(R ' <li><a href= ' (. *?) "Target ") url_list=Re.findall (pattern,html)#print (url_list)    Print(' Start saving first ', Page_no,' Page! ') forUrlinchUrl_list:get_one_volume_pic (URL)defGet_max_page (base_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Base_url+' 1 ', headers=Headers). Text#print (HTML)Pattern=Re.Compile(r "<a class= ' page-numbers ' (. *?) </span> (\d+?) <span class= (. *?) ></span></a> ") Max_no=Re.findall (pattern,html) [-1][-2]#print (max_no)    returnMax_noif __name__==' __main__ ': Max_no=Get_max_page (Base_url)Print(' At present there are a total of sister figures{0}page! '.format(Max_no)) No=int(input(' Please enter the content of how many pages you want to download: ')) forIinch Range(1, No+1): Get_one_volume_all_pic (i)

Look at the results, huh!
"Insert picture, result"

Content comparison of the hot, not open preview, straight men download Good python, try to put your hard disk fill ~ ~ ~

Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.