Using Scrapy-splash to crawl the dynamic page generated by JS

Source: Internet
Author: User
Tags docker run

At present, in order to speed up the page loading speed, many parts of the page are generated with JS, and for Scrapy crawler is a big problem, because Scrapy no JS engine, so crawling is static page, for JS generated dynamic page can not be obtained.

Solution:

    • Using third-party middleware to provide JS rendering service: Scrapy-splash, etc.
    • Using WebKit or based on the WebKit library

Splash is a JavaScript rendering service. It is a lightweight browser that implements the HTTP API, and Splash is implemented in Python, using both twisted and QT. Twisted (QT) is used to enable the service to have asynchronous processing capability to perform webkit concurrency.

Here's how to use Scrapy-splash:

  1. To install the Scrapy-splash library with Pip:
    $ pip Install Scrapy-splash

  2. Scrapy-splash is using the splash HTTP API, so a splash instance is required, typically using Docker to run splash, so you need to install Docker.

  3. Install Docker and run Docker after installation.

  4. Pull the Image:
    $ docker Pull Scrapinghub/splash

  5. Run Scrapinghub/splash with Docker:
    $ docker run-p 8050:8050 Scrapinghub/splash

  6. Configure the Splash service (the following operations are all in settings.py):

    1) Add Splash server address:

     SPLASH_URL = ‘http://localhost:8050‘   

    2) Add the splash middleware to the Downloader_middleware:

    Downloader_middlewares = {'scrapy_splash. Splashcookiesmiddleware': 723,'scrapy_splash. Splashmiddleware': 725,'  Scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}

    3) Enable Splashdeduplicateargsmiddleware:

     SPIDER_MIDDLEWARES = { ‘scrapy_splash.SplashDeduplicateArgsMiddleware‘: 100, } 

    4) Set a custom Dupefilter_class:

     DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘ 

    5) A custom cache storage backend:

    ' Scrapy_splash. Splashawarefscachestorage'

  7. Example
    Get HTML content:

Importscrapy fromScrapy_splashImportsplashrequestclassMyspider (scrapy. Spider): Start_urls= ["http://example.com","Http://example.com/foo"]    defstart_requests (self): forwr.inchSelf.start_urls:yieldSplashrequest (URL, self.parse, args={'wait': 0.5})    defParse (self, response):#Response.body is a result of render.html call;        #contains HTML processed by a browser.        # ...        

Using Scrapy-splash to crawl the dynamic page generated by JS

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.