Python3 Web crawler (10): This handsome, muscular male-infested world (climbing handsome figure)

Last Update:2017-05-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

reprint please indicate author and source: http://blog.csdn.net/c406495762
Operating Platform: Windows
python version: python3.x
IDE: Sublime Text3

Objective
Pre-knowledge
Actual combat
- 1 background
- 2 Requests Installation
- 3 Crawling single-page destination connections
- 4 Crawling multiple-page destination connections
- 5 Single Photo Download
- 6 overall Code
Summarize

1 Preface

Before, feel like on-line "crawl the sister figure" Such a crawler tutorial has a lot, so I did not write crawl pictures of the actual combat tutorial. Recently, a friend who was interested in my reptile tutorial said, I hope I can get a tutorial to crawl pictures. So, let's talk about how to crawl pictures today. In fact, crawling pictures for some of the packet analysis of things, or a lot of simple, just find the address of the image, we can download it down. Other people's climb to take pictures of the tutorial are crawling "sister map", there are climbing "fried egg" net, crawling "sister figure" NET, sister pictures that called kept burst Ah! is a dizzying. Look at my body is also a day less than a day. Out of the general friends of the body of the consideration, today we do not climb sister figure, I climbed "handsome figure"! (PS: I will not tell you that I am trying to see if I have a beauty programmer coming!) ）

2 Preliminary knowledge

To be able to learn new knowledge, this reptile tutorial uses the requests third-party library, which is not a Python3 built-in Urllib.request library, but a powerful URLLIB3-based third-party library.

The basic methods of the requests library are as follows:

Official Chinese course Address: http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

Because the official "quick Start" tutorial has been well-organized, and this tutorial is also the simplest requests.get (), so the third-party library requests use of methods, no longer described. For details, please see the official Chinese course, the people who have URLLIB2 foundation, or are good at getting started.

3 Actual Combat 3.1 background

Climb fetch "Handsome" net of handsome man pictures!

Url:http://www.shuaia.net/index.html

Take a look at what the site looks like:

3.2 Requests Installation

In cmd, install the third-party library requests using the following instructions:

install requests

Or:

easy_install requests

3.3 Crawling single-page destination connections

By examining the elements, it is not difficult to find that the address of the target is stored in the href attribute of the label with the class attribute "Item-img" <a> . At this point, someone might ask why not use the src attribute of the following label? Because this image is the homepage of the browse image, according to the address of the picture saved, too small, and unclear. Adhering to the love of "HD Uncensored" spirit, this picture is not what I want. Therefore, first get the address of the target, that is, we click on the image, enter the page address, and then according to the next page, find the address of the image.

Code:

#-*-Coding:utf-8-*- fromBS4 Import Beautifulsoupimport Requestsif__name__ = =' __main__ ': url =' http://www.shuaia.net/index.html 'headers = {"User-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 "} req = requests.Get(url = url,headers = headers) req.encoding =' Utf-8 'HTML = req.textBF = BeautifulSoup (HTML,' lxml ') Targets_url = Bf.find_all (class_=' item-img ') List_url = [] for  each inchTargets_url:list_url.append ( each. img.Get(' Alt ') +' = '+ each.Get(' href ')) print (List_url)

We save the crawled information to the list, and the image name and image address Use the "=" connection to run the result:

3.4 Crawling multiple-page destination connections

Turning to the second page, it is easy to find that the address has changed to: www.shuaia.net/index_2.html. Third page, fourth page, fifth page and so on.

Code:

#-*-Coding:utf-8-*- fromBS4 Import Beautifulsoupimport Requestsif__name__ = =' __main__ ': List_url = [] for Num inchRange1, -):if Num==1: url =' http://www.shuaia.net/index.html '        Else: url =' http://www.shuaia.net/index_%d.html '%Numheaders = {"User-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 "} req = requests.Get(url = url,headers = headers) req.encoding =' Utf-8 'HTML = req.textBF = BeautifulSoup (HTML,' lxml ') Targets_url = Bf.find_all (class_=' item-img ') for  each inchTargets_url:list_url.append ( each. img.Get(' Alt ') +' = '+ each.Get(' href ')) print (List_url)

We crawl a little bit less and crawl the target connection for the first 19 pages:

3.5 Single Photo Download

Enter the destination address to review the element. As you can see, the image address is stored in the SRC attribute of the div->div->img with the class attribute "Wr-single-content-list".

Code:

Target_url =' http://www.shuaia.net/rihanshuaige/2017-05-18/1294.html 'filename =' Jang Keun Suk shooting biker handsome portrait '+'. jpg 'headers = {"User-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 "}img_req = requests.Get(url = target_url,headers = headers) img_req.encoding =' Utf-8 'img_html = Img_req.textImg_bf_1 = BeautifulSoup (img_html,' lxml ') Img_url = Img_bf_1.find_all (' div ', class_=' Wr-single-content-list ') img_bf_2 = BeautifulSoup (str (img_url),' lxml ') Img_url =' Http://www.shuaia.net '+ img_bf_2.Div. img.Get(' src ')if ' Images '  not inchOs.listdir (): Os.makedirs (' Images ') urlretrieve (url = img_url,filename =' images/'+ filename) Print (' Download done! ')

We will save the picture in the directory where the program files are located in the Imgase directory:

3.6 Overall Code

We've got a connection to each picture and we can download it. Integrate the code, download a little less, download the first 2 pages of the picture.

#-*-Coding:utf-8-*- fromBS4 Import BeautifulSoup fromUrllib.request Import Urlretrieveimport Requestsimport osimport Timeif__name__ = =' __main__ ': List_url = [] for Num inchRange1,3):if Num==1: url =' http://www.shuaia.net/index.html '        Else: url =' http://www.shuaia.net/index_%d.html '%Numheaders = {"User-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 "} req = requests.Get(url = url,headers = headers) req.encoding =' Utf-8 'HTML = req.textBF = BeautifulSoup (HTML,' lxml ') Targets_url = Bf.find_all (class_=' item-img ') for  each inchTargets_url:list_url.append ( each. img.Get(' Alt ') +' = '+ each.Get(' href ')) Print (' Connection acquisition complete ') forEach_imginchList_url:img_info = each_img.Split(' = ') Target_url = img_info[1] filename = img_info[0] +'. jpg 'Print' Download: '+ filename) headers = {"User-agent":"mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 "} Img_req = requests.Get(url = target_url,headers = headers) img_req.encoding =' Utf-8 'img_html = Img_req.textImg_bf_1 = BeautifulSoup (img_html,' lxml ') Img_url = Img_bf_1.find_all (' div ', class_=' Wr-single-content-list ') img_bf_2 = BeautifulSoup (str (img_url),' lxml ') Img_url =' Http://www.shuaia.net '+ img_bf_2.Div. img.Get(' src ')if ' Images '  not inchOs.listdir (): Os.makedirs (' Images ') urlretrieve (url = img_url,filename =' images/'+ filename) Time. Sleep (1) Print (' Download done! ')

The results of the operation are as follows:

The final downloaded Image:

4 Summary

Isn't the picture very handsome? Are you satisfied?

This crawl method is relatively simple and slow. Server and anti-crawler, so can not climb too fast, each download a picture needs to add a 1 second delay, otherwise it will be disconnected by the server. Of course, the solution is still there, because it is not the focus of this article, the opportunity to elaborate later.

The principle of crawling pictures is like this, if you want to climb the girl can go to the "Fried egg net" to see, package you satisfied.

PS: If you feel that this chapter is helpful to you, welcome to pay attention, comment, top!

Python3 Web crawler (10): This handsome, muscular male-infested world (climbing handsome figure)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 Web crawler (10): This handsome, muscular male-infested world (climbing handsome figure)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 Web crawler (10): This handsome, muscular male-infested world (climbing handsome figure)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support