If the crawler needs to show the speed, I think is to download the picture bar, originally wanted to choose to fry eggs there download pictures, where the beautiful pictures are high-quality, my manuscript is almost written well, but today re-look at the entrance to the sister map off.
As for why off, we can go to check the reason of XXX daily shut down or Baidu, here is not much to say, this time I chose to download no copyright high-definition pictures, because people do from the media is very afraid of infringement, find no copyright pictures as if it became a daily work, so this time I chose this site
https://unsplash.com/
So what's the difference between using async and not using async?
(the right is used asynchronously, the left is not used asynchronously, because for testing, so choose to download 12 images can be)
As you can see, it's almost 6 times times less time to run after using async than a program that doesn't use async, does it feel high? Let's analyze how to crawl it.
1. Find the landing page
This site home has a bunch of pictures, and pull down will automatically refresh, it is obviously an AJAX loading, but not afraid, dynamic loading this thing we talked about before, so open developer tools to see what kind of request it.
Pull down is easy to see this request, this is a GET request, the status code is 200, the URL is https://unsplash.com/napi/photos?page=3&per_page=12&order_by= Latest, there are three parameters, it is easy to know that page parameters are the pages, this parameter is changed, the other parameters are unchanged.
The returned content is a JSON type, in the links below the download is our image download link, now everything is clear, then the following is the code.
2. Code section
Async def __get_content (self, link):
Async with Aiohttp. Clientsession () as session:
Response = await session.get (link)
Content = await response.read ()
return content
This is the way to get the contents of a picture, aiohttpclientsession and requests.session are almost the same, except that the method of acquiring Unicode encoding becomes read ().
Here is the complete code
Import requests, OS, time
Import Aiohttp, Asyncio
Class Spider (object):
def __init__ (self):
Self.headers = {
' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 '}
Self.num = 1
If ' picture ' isn't in Os.listdir ('. '):
Os.mkdir (' pictures ')
Self.path = Os.path.join (Os.path.abspath ('. '), ' picture ')
Os.chdir (self.path) # Enter file download path
Async def __get_content (self, link):
Async with Aiohttp. Clientsession () as session:
Response = await session.get (link)
Content = await response.read ()
return content
def __get_img_links (self, page):
url = ' Https://unsplash.com/napi/photos '
data = {
' Page ': page,
' Per_page ': 12,
' order_by ': ' Latest '
}
Response = Requests.get (URL, params=data)
if Response.status_code = = 200:
Return Response.json ()
Else
Print (' request failed with status code%s '% response.status_code)
Async def __download_img (Self, img):
Content = await self.__get_content (img[1])
With open (img[0]+ '. jpg ', ' WB ') as F:
F.write (content)
Print (' Download%s picture succeeded '% self.num)
Self.num + = 1
def run (self):
Start = Time.time ()
For x in range (1, 101): # Download the 100-page picture, or change the number of pages yourself
Links = self.__get_img_links (x)
tasks = [Asyncio.ensure_future (self.__download_img ((link[' id '), link[' links '] [' Download ']))))
loop = Asyncio.get_event_loop ()
Loop.run_until_complete (asyncio.wait (Tasks))
If Self.num >= 10: # test speed use, if you need to download more than one picture can comment this code
Break
End = Time.time ()
Print (' Total run%s '% (End-start))
def main ():
Spider = Spider ()
Spider.run ()
if __name__ = = ' __main__ ':
Main ()
Can see less than 50 lines of code will be able to download the entire Web site pictures, have to blow up the powerful python ~ ~ ~
Benefit Time:
Incoming group: 125240963 to get dozens of sets of PDFs Oh!
Crawler high-play teach you to download a large picture of high-definition in Python every second, quick?