At present, in order to speed up the page loading speed, many parts of the page are generated with JS, and for Scrapy crawler is a big problem, because Scrapy no JS engine, so crawling is static page, for JS generated dynamic page can not be obtained.
Solution:
- Using third-party middleware to provide JS rendering service: Scrapy-splash, etc.
- Using WebKit or based on the WebKit library
Splash is a JavaScript rendering service. It is a lightweight browser that implements the HTTP API, and Splash is implemented in Python, using both twisted and QT. Twisted (QT) is used to enable the service to have asynchronous processing capability to perform webkit concurrency.
Here's how to use Scrapy-splash:
To install the Scrapy-splash library with Pip:
$ pip Install Scrapy-splash
Scrapy-splash is using the splash HTTP API, so a splash instance is required, typically using Docker to run splash, so you need to install Docker.
Install Docker and run Docker after installation.
Pull the Image:
$ docker Pull Scrapinghub/splash
Run Scrapinghub/splash with Docker:
$ docker run-p 8050:8050 Scrapinghub/splash
Configure the Splash service (the following operations are all in settings.py):
1) Add Splash server address:
SPLASH_URL = ‘http://localhost:8050‘
2) Add the splash middleware to the Downloader_middleware:
Downloader_middlewares = {'scrapy_splash. Splashcookiesmiddleware': 723,'scrapy_splash. Splashmiddleware': 725,' Scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}
3) Enable Splashdeduplicateargsmiddleware:
SPIDER_MIDDLEWARES = { ‘scrapy_splash.SplashDeduplicateArgsMiddleware‘: 100, }
4) Set a custom Dupefilter_class:
DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘
5) A custom cache storage backend:
' Scrapy_splash. Splashawarefscachestorage'
Example
Get HTML content:
Importscrapy fromScrapy_splashImportsplashrequestclassMyspider (scrapy. Spider): Start_urls= ["http://example.com","Http://example.com/foo"] defstart_requests (self): forwr.inchSelf.start_urls:yieldSplashrequest (URL, self.parse, args={'wait': 0.5}) defParse (self, response):#Response.body is a result of render.html call; #contains HTML processed by a browser. # ...
Using Scrapy-splash to crawl the dynamic page generated by JS