Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

Source: Internet
Author: User

If the crawler needs to show the speed, I think is to download the picture bar, originally wanted to choose to fry eggs there download pictures, where the beautiful pictures are high-quality, my manuscript is almost written well, but today re-look at the entrance to the sister map off.

As for why off, we can go to check the reason of XXX daily shut down or Baidu, here is not much to say, this time I chose to download no copyright high-definition pictures, because people do from the media is very afraid of infringement, find no copyright pictures as if it became a daily work, so this time I chose this site

https://unsplash.com/

So what's the difference between using async and not using async?

(the right is used asynchronously, the left is not used asynchronously, because for testing, so choose to download 12 images can be)

As you can see, it's almost 6 times times less time to run after using async than a program that doesn't use async, does it feel high? Let's analyze how to crawl it.

1. Find the landing page

This site home has a bunch of pictures, and pull down will automatically refresh, it is obviously an AJAX loading, but not afraid, dynamic loading this thing we talked about before, so open developer tools to see what kind of request it.

Pull down is easy to see this request, this is a GET request, the status code is 200, the URL is https://unsplash.com/napi/photos?page=3&per_page=12&order_by= Latest, there are three parameters, it is easy to know that page parameters are the pages, this parameter is changed, the other parameters are unchanged.

The returned content is a JSON type, in the links below the download is our image download link, now everything is clear, then the following is the code.

2. Code section

Async def __get_content (self, link):

Async with Aiohttp. Clientsession () as session:

Response = await session.get (link)

Content = await response.read ()

return content

This is the way to get the contents of a picture, aiohttpclientsession and requests.session are almost the same, except that the method of acquiring Unicode encoding becomes read ().

Here is the complete code

Import requests, OS, time

Import Aiohttp, Asyncio

Class Spider (object):

def __init__ (self):

Self.headers = {

' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 '}

Self.num = 1

If ' picture ' isn't in Os.listdir ('. '):

Os.mkdir (' pictures ')

Self.path = Os.path.join (Os.path.abspath ('. '), ' picture ')

Os.chdir (self.path) # Enter file download path

Async def __get_content (self, link):

Async with Aiohttp. Clientsession () as session:

Response = await session.get (link)

Content = await response.read ()

return content

def __get_img_links (self, page):

url = ' Https://unsplash.com/napi/photos '

data = {

' Page ': page,

' Per_page ': 12,

' order_by ': ' Latest '

}

Response = Requests.get (URL, params=data)

if Response.status_code = = 200:

Return Response.json ()

Else

Print (' request failed with status code%s '% response.status_code)

Async def __download_img (Self, img):

Content = await self.__get_content (img[1])

With open (img[0]+ '. jpg ', ' WB ') as F:

F.write (content)

Print (' Download%s picture succeeded '% self.num)

Self.num + = 1

def run (self):

Start = Time.time ()

For x in range (1, 101): # Download the 100-page picture, or change the number of pages yourself

Links = self.__get_img_links (x)

tasks = [Asyncio.ensure_future (self.__download_img ((link[' id '), link[' links '] [' Download ']))))

loop = Asyncio.get_event_loop ()

Loop.run_until_complete (asyncio.wait (Tasks))

If Self.num >= 10: # test speed use, if you need to download more than one picture can comment this code

Break

End = Time.time ()

Print (' Total run%s '% (End-start))

def main ():

Spider = Spider ()

Spider.run ()

if __name__ = = ' __main__ ':

Main ()

Can see less than 50 lines of code will be able to download the entire Web site pictures, have to blow up the powerful python ~ ~ ~

Benefit Time:

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.