Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

Last Update:2017-11-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Insert picture, sister figure Home"

Ha, only dare to put this to the point.
Today give straight men a little benefit, through today's code, you can put your hard drive full of ~
Here we go!

First step: How to get a picture

If we know the URL of an image, how do we get to this picture?
Let's look at the simplest way:
"Insert picture, single page URL"

We get to the content of the picture, write it to the file through the binary stream, and save it up.
This time lazy, all the pictures are saved in the current directory.

import requestsurl=‘http://i.meizitu.net/2017/11/24a02.jpg‘pic=requests.get(url).contentpic_name=url.split(r‘/‘)[-1]withopen(pic_name,‘wb‘as f:    f.write(pic)

Unexpectedly error, get the picture unexpectedly is like this:
"Insert picture, anti-crawler-anti-theft chain"

It seems that all kinds of web sites are starting to reverse the reptile, we try to add some camouflage information.
It is estimated that a lot of riot young people take the girl to practice reptiles ...
After exploring, add Referer header information, in order to continue to download, this referer corresponding content is the parent address of the image, that is, the address of the page where this image is stored.
OK, we get the picture, the code is as follows, made some changes, do not directly open the image address, but open a single page picture of the page where the address.

ImportRequestsImportRepic_parent=' HTTP://WWW.MZITU.COM/109942/2 'defSave_one_pic (pic_parent): Pic_path=Pic_parent.split (R '/')[-2] Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} headers[' Referer ']=Pic_parent#新增属性, otherwise you won't get the picture.Html=Requests.get (pic_parent,headers=Headers). Text pattern=Re.Compile(R '  ") Pic_url=Re.findall (pattern,html) [0]#匹配单页图片地址Pic=Requests.get (Pic_url, headers=Headers). Content Pic_name=Pic_path+' _ '+Pic_url.split (R '/')[-1] with Open(Pic_name,' WB ') asF:f.write (pic)if __name__==' __main__ ': Save_one_pic (Pic_parent)

Step two: Get all the picture URLs of a model and save

The first step, we learned to save a page of pictures.
But every model will usually have dozens of photos, if you get all the photos of the model at one time?
"Insert picture, Model multi-page"

In fact, it is very convenient, like the previous project, as long as the URL to do a little modification of the good
Http://www.mzitu.com/110107/3
The last one to change is OK.
Of course, we first have to get the total number of pictures of the model, or match it with a regular one.
The code is as follows:

defGet_one_volume_pic (Pic_volume_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Pic_volume_url, headers=Headers). Text pattern=Re.Compile(r "(. *) <span> (\d+?) </span></a><a href= ' (. *?) ' ><span> Next Page ") Max_no=int(Re.findall (pattern,html) [0][-2])#对list进行操作, you can use Baidu a bitFirst_Name=Pic_volume_url.split ('/')[-1]#print (first_name)    #print (max_no)    Print('--Start saving: ', first_name) p=Pool () p.Map(Save_one_pic,[pic_volume_url+'/'+Str(i) forIinch Range(1, Max_no+1)])# for I in Range (max_no+1):    # URL=PIC_VOLUME_URL+STR (i)    # save_one_pic (URL)    Print('--', first_name,', Save complete ')

Step three: Get a picture of all the models on a page

"Insert Picture, homepage multi-page"

http://www.mzitu.com/page/2/
Each page has a collection of pictures of the models, and we want to get the URL entry for each model on this page, in order to use the second step to download the image.
"Insert picture, model URL entry"

#保存基本页面上的第pageNo页def get_one_volume_all_pic(page_No): url=base_url+str(page_No) headers = {‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0‘} html = requests.get(url, headers=headers).text pattern=re.compile(r‘<li><a href="(.*?)" target‘) url_list=re.findall(pattern,html) #print(url_list) print(‘开始保存第‘,page_No,‘页！‘) for url in url_list: get_one_volume_pic(url)

Fourth step: Get the maximum number of pages in the page

As the number of pictures is huge, we try not to download them all at once.
To download a specified number of pages, you need to get the number of pages.
This method is similar to the method in the second step, please refer to.

Base_url=' http://www.mzitu.com/page/'#需要加页码defGet_max_page (base_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Base_url+' 1 ', headers=Headers). Text#print (HTML)Pattern=Re.Compile(r "<a class= ' page-numbers ' (. *?) </span> (\d+?) <span class= (. *?) ></span></a> ") Max_no=Re.findall (pattern,html) [-1][-2]#print (max_no)    returnMax_no

All code

Finally finished, although the code is not much, constantly in error, debugging, or very time-consuming.
Now all the pictures are saved under the current folder, the following to improve, each volume to build a separate folder, so it looks good.

ImportRequestsImportRe fromMultiprocessingImportPoolbase_url=' http://www.mzitu.com/page/'#需要加页码defSave_one_pic (pic_parent): Pic_path=Pic_parent.split (R '/')[-2] Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (pic_parent,headers=Headers). Text pattern=Re.Compile(R '  ") Pic_url=Re.findall (pattern,html) [0] Headers[' Referer ']=Pic_parent# Add properties, otherwise you won't get the picturePic=Requests.get (Pic_url, headers=Headers). Content Pic_name=Pic_path+' _ '+Pic_url.split (R '/')[-1] with Open(Pic_name,' WB ') asF:f.write (pic)Print('------saved successfully: ', Pic_name)defGet_one_volume_pic (Pic_volume_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Pic_volume_url, headers=Headers). Text pattern=Re.Compile(r "(. *) <span> (\d+?) </span></a><a href= ' (. *?) ' ><span> Next Page ") Max_no=int(Re.findall (pattern,html) [0][-2]) first_name=Pic_volume_url.split ('/')[-1]#print (first_name)    #print (max_no)    Print('--Start saving: ', first_name) p=Pool () p.Map(Save_one_pic,[pic_volume_url+'/'+Str(i) forIinch Range(1, Max_no+1)])# for I in Range (max_no+1):    # URL=PIC_VOLUME_URL+STR (i)    # save_one_pic (URL)    Print('--', first_name,', Save complete ')#保存基本页面上的第pageNo页defGet_one_volume_all_pic (page_no): URL=Base_url+Str(page_no) headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (URL, headers=Headers). Text pattern=Re.Compile(R ' <li><a href= ' (. *?) "Target ") url_list=Re.findall (pattern,html)#print (url_list)    Print(' Start saving first ', Page_no,' Page! ') forUrlinchUrl_list:get_one_volume_pic (URL)defGet_max_page (base_url): Headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '} HTML=Requests.get (Base_url+' 1 ', headers=Headers). Text#print (HTML)Pattern=Re.Compile(r "<a class= ' page-numbers ' (. *?) </span> (\d+?) <span class= (. *?) ></span></a> ") Max_no=Re.findall (pattern,html) [-1][-2]#print (max_no)    returnMax_noif __name__==' __main__ ': Max_no=Get_max_page (Base_url)Print(' At present there are a total of sister figures{0}page! '.format(Max_no)) No=int(input(' Please enter the content of how many pages you want to download: ')) forIinch Range(1, No+1): Get_one_volume_all_pic (i)

Look at the results, huh!
"Insert picture, result"

Content comparison of the hot, not open preview, straight men download Good python, try to put your hard disk fill ~ ~ ~

Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler Combat "5" Send benefits! Python Gets the content on the sister graph

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support