Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If the crawler needs to show the speed, I think is to download the picture bar, originally wanted to choose to fry eggs there download pictures, where the beautiful pictures are high-quality, my manuscript is almost written well, but today re-look at the entrance to the sister map off.

As for why off, we can go to check the reason of XXX daily shut down or Baidu, here is not much to say, this time I chose to download no copyright high-definition pictures, because people do from the media is very afraid of infringement, find no copyright pictures as if it became a daily work, so this time I chose this site

https://unsplash.com/

So what's the difference between using async and not using async?

(the right is used asynchronously, the left is not used asynchronously, because for testing, so choose to download 12 images can be)

As you can see, it's almost 6 times times less time to run after using async than a program that doesn't use async, does it feel high? Let's analyze how to crawl it.

1. Find the landing page

This site home has a bunch of pictures, and pull down will automatically refresh, it is obviously an AJAX loading, but not afraid, dynamic loading this thing we talked about before, so open developer tools to see what kind of request it.

Pull down is easy to see this request, this is a GET request, the status code is 200, the URL is https://unsplash.com/napi/photos?page=3&per_page=12&order_by= Latest, there are three parameters, it is easy to know that page parameters are the pages, this parameter is changed, the other parameters are unchanged.

The returned content is a JSON type, in the links below the download is our image download link, now everything is clear, then the following is the code.

2. Code section

Async def __get_content (self, link):

Async with Aiohttp. Clientsession () as session:

Response = await session.get (link)

Content = await response.read ()

return content

This is the way to get the contents of a picture, aiohttpclientsession and requests.session are almost the same, except that the method of acquiring Unicode encoding becomes read ().

Here is the complete code

Import requests, OS, time

Import Aiohttp, Asyncio

Class Spider (object):

def __init__ (self):

Self.headers = {

' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 '}

Self.num = 1

If ' picture ' isn't in Os.listdir ('. '):

Os.mkdir (' pictures ')

Self.path = Os.path.join (Os.path.abspath ('. '), ' picture ')

Os.chdir (self.path) # Enter file download path

Async def __get_content (self, link):

Async with Aiohttp. Clientsession () as session:

Response = await session.get (link)

Content = await response.read ()

return content

def __get_img_links (self, page):

url = ' Https://unsplash.com/napi/photos '

data = {

' Page ': page,

' Per_page ': 12,

' order_by ': ' Latest '

}

Response = Requests.get (URL, params=data)

if Response.status_code = = 200:

Return Response.json ()

Else

Print (' request failed with status code%s '% response.status_code)

Async def __download_img (Self, img):

Content = await self.__get_content (img[1])

With open (img[0]+ '. jpg ', ' WB ') as F:

F.write (content)

Print (' Download%s picture succeeded '% self.num)

Self.num + = 1

def run (self):

Start = Time.time ()

For x in range (1, 101): # Download the 100-page picture, or change the number of pages yourself

Links = self.__get_img_links (x)

tasks = [Asyncio.ensure_future (self.__download_img ((link[' id '), link[' links '] [' Download ']))))

loop = Asyncio.get_event_loop ()

Loop.run_until_complete (asyncio.wait (Tasks))

If Self.num >= 10: # test speed use, if you need to download more than one picture can comment this code

Break

End = Time.time ()

Print (' Total run%s '% (End-start))

def main ():

Spider = Spider ()

Spider.run ()

if __name__ = = ' __main__ ':

Main ()

Can see less than 50 lines of code will be able to download the entire Web site pictures, have to blow up the powerful python ~ ~ ~

Benefit Time:

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support