How do I grab the latest emoticons from the bucket chart network?

Source: Internet
Author: User
Tags xpath

One: Target

The first time to use the Scrapy framework encountered a lot of pits, stick to search, modify the code can solve the problem. This crawl is a map of the latest image of the Web site Www.doutula.com/photo/list, practice using the Scrapy framework and the use of the random user agent to prevent the ban, bucket chart Expression pack updated daily, a total of 50,000 can crawl around the expression to the hard disk. In order to save time, I grabbed more than 10,000 pictures.

Two: scrapy Introduction

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.

Use procedure

  • Create a Scrapy Project

  • Define the extracted item

  • Write the spider for the crawl site and extract the Item

  • Write the item Pipeline to store the extracted item (that is, the data)

The following diagram shows the architecture of the scrapy, including the components and an overview of the data flows that occur in the system (shown by the green arrows). Here is a brief description of each component and a link to the detailed content. The data flow is described below


Paste_image.png


Component

    • Scrapy Engine
      The engine is responsible for controlling the flow of data across all components in the system and triggering events when the corresponding action occurs. For more information, see the Data Flow section below.

    • Scheduler (Scheduler)
      The scheduler accepts the request from the engine and queue them up so that the engine is available to the engine when it requests them.

    • Downloader (Downloader)
      The downloader is responsible for retrieving the page data and providing it to the engine, which is then provided to the spider.

    • Spiders
      Spiders are classes that scrapy users write to parse the response and extract the item (that is, the item that gets to it) or an additional follow-up URL. Each spider is responsible for processing a specific (or some) Web site. For more information, see Spiders.

    • Item Pipeline
      Item pipeline is responsible for handling the item that was extracted by the spider. Typical processing is cleanup, validation, and persistence (for example, access to a database). More content View Item Pipeline.

    • Downloader middleware (Downloader middlewares)
      The downloader middleware is a specific hook between the engine and the downloader (specific hook) that handles the response passed downloader to the engine. It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See the Downloader middleware (Downloader middleware) for more information.

    • Spider Middleware (spider middlewares)
      Spider middleware is a specific hook (specific hook) between the engine and spider that handles the spider's input (response) and output (items and requests). It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See Spider Middleware (middleware) for more information.

Three: Example analysis

1. From the homepage of the website to enter the latest bucket chart after the URL is, click on the second page to see the URL becomes, then we know the URL of the final page is a different number of pages. Then the spider's start_urls start entrance is defined as follows, crawling 1 to 20 pages of image emoticons. To download more emoticons pages you can add more.

Start_urls = [' https://www.doutula.com/photo/list/?page={} '. Format (i) for I in range (1, 20)]

2. Enter the developer Mode Analysis page structure, you can see the following structure. Right-click to copy the XPath address to get the full expression of the a tag content. A[1] means the first a, minus [1] is all a.

*[@id = "Pic-detail"]/div/div[1]/div[2]/a

Paste_image.png


It is worth noting that there are two kinds of expressions here: a jpg, a GIF motion diagram. If you fetch the image address, only the src of the first img below the a tag will go wrong, so we crawl the value containing the data-original in the IMG. Here a tab below also a P tag is a picture introduction, we also crawl down as the name of the picture file.

The picture's connection is ' http: ' + content.xpath ('//img/@data-original ') The name of the picture is Content.xpath ('//p/text () ')

Paste_image.png

Four: Actual combat code

Full Code address Github.com/rieuse/learnpython
1. First Use the command line tool to enter the code to create a new Scrapy project, and then create a crawler.

Scrapy startproject scrapydoutucd scrapydoutu\scrapydoutu\spidersscrapy genspider doutula doutula.com

2. Open the items.py in the Doutu folder and change to the following code to define the items we crawled.

Import Scrapyclass Doutuitem (scrapy. Item):    img_url = scrapy. Field ()    name = Scrapy. Field ()

3. Open the doutula.py in the Spiders folder and change to the following code, this is the crawler main program.

#-*-coding:utf-8-*-import osimport scrapyimport requestsfrom ScrapyDoutu.items Import Doutuitemsclass Doutu (scrapy. Spider): name = "Doutu" allowed_domains = ["doutula.com", "sinaimg.cn"] start_urls = [' Https://www.doutula.com/ph  oto/list/?page={} '. Format (i) for I in range (1, 40)] # We'll crawl 40 pages, Def parse (self, response): i = 0for content in Response.xpath ('//*[@id = "Pic-detail"]/div/div[1]/div[2]/a '): i + = 1item = Doutuitems () item[' img_url '] = ' http: ' + content.xpath ('//img/@data-original '). Extract () [i]item[' name '] = Content.xpath ('//p/text () '). Extract () [I]try:                If not os.path.exists (' Doutu '): Os.makedirs (' Doutu ') R = requests.get (item[' Img_url '])                    filename = ' doutu\\{} '. Format (item[' name ') + item[' Img_url '][-4:]with open (filename, ' WB ') as FO: Fo.write (r.content) except:print (' Error ') yield item 

3. There are a lot of notable parts in this:

    • Because the address of the picture is placed in the sinaimg.cn, so to join the Allowed_domains list

    • content.xpath('//img/@data-original').extract()[i]Extract () is used to return a list (that is, the system comes with the one) inside is some of the content you extracted, [i] is combined with the preceding I loop each time to get the next label content, if not set, will put all the contents of the label into a dictionary value.

    • filename = 'doutu\{}'.format(item['name']) + item['img_url'][-4:]is used to get the name of the picture, and finally item[' Img_url '][-4:] is to get the last four bits of the image address so that you can ensure that different file formats use their respective suffixes.

    • The last point is that if the XPath does not match correctly, <get http://*****> (Referer:none) appears

4. Configure settings.py, if you want to crawl a little faster concurrent_requests settings larger, download_delay settings smaller, or 0.

#-*-Coding:utf-8-*-bot_name = ' Scrapydoutu ' spider_modules = [' scrapydoutu.spiders ']newspider_module = ' Scrapydoutu.spiders ' downloader_middlewares = {    ' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware ': None,    ' ScrapyDoutu.middlewares.RotateUserAgentMiddleware ': 400,}robotstxt_obey = False # Robots.txt policies that do not follow the Web site concurrent_requests = 0.2 Downloader Concurrent Requests (concurrent requests) maximum value Download_delay = # The time to wait before downloading the same site page can be used to limit the crawl speed to relieve the server pressure. cookies_enabled = False # Close Cookies

5. Configure the UA settings in the middleware.py Mate settings to randomly select the UA in the download to have a certain anti-ban effect, add the following code based on the original code. The user_agent_list here can add more.

Import randomfrom scrapy.downloadermiddlewares.useragent import Useragentmiddlewareclass rotateuseragentmiddleware ( Useragentmiddleware): Def __init__ (self, user_agent= "): Self.user_agent = user_agent def process_request (sel F, request, spider): UA = Random.choice (self.user_agent_list) if Ua:print (UA) Request . Headers.setdefault (' User-agent ', ua) user_agent_list = ["mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "" mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Wind OWS NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Applew ebkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2;   WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 ",     "Mozilla/5.0 (X11;  Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Applew ebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (window S NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Applew ebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 Safari        /536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3 ", "Mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6. 2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "]

6. Until now, the code has been completed. So start doing it!
scrapy crawl doutu
You can then see the download on one side and modify the user Agent.


Paste_image.png

V: summary

Learning to use scrapy encounters a lot of pits, but a powerful search system won't make me feel lonely. So feel scrapy still very strong also very meaning, behind continue to learn other aspects of scrapy content.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.