Scrapy Custom Crawler-Crawl javascript---Yi Tang

Source: Internet
Author: User

Many websites use JavaScript ... Web content is dynamically generated by JS, some JS events triggered by the page content changes, links open. Even some websites do not work at all without JS, and instead return you with something like "Please open browser js".

There are four solutions for JavaScript support:
1, write code to simulate the relevant JS logic.
2, call an interface browser, similar to a variety of widely used in testing, selenium this kind.
3, using a non-interface browser, a variety of webkit-based, CASPERJS,PHANTOMJS and so on.
4, combined with a JS execution engine, to implement a lightweight browser itself. It's a lot of difficulty.

For a simple limited crawl task, if you can simulate the JS logic through the code, the preferred option, for example, in the DuckDuckGo search engine, the page flip This action is triggered by JS. The simulation seems to be difficult, and then I notice the second form of his page, which seems to be able to page the pages after a submit. Try it and it sure does.
When writing code to simulate the relevant JS logic, first try to close the browser's JS, to see if you can get what you need. Some pages offer no JS compatibility. No, no, No. Chrome's console or Firebug observe the JS logic, possibly Ajax. With URLLIB2 ( Recommended requests library) simulation can be, or modify the DOM, and so on, with lxml such as the corresponding modification can be. That's what JS does, and the Python code corresponds to the simulation execution.

You can also choose to use selenium this type, the disadvantage is that the efficiency is very low, you should first Test selenium to start a browser instance of the time it takes you to accept. This time is usually in the second level. Again, the browser opens the page rendering, which is even slower. The plan is also good, given the acceptable efficiency.
Another problem with this scenario is that the selenium does not work visually on servers without a desktop environment.

The scale is not small, analog JS is not available, selenium efficiency is too low, or need to be executed in the absence of desktop environment. There are no interface browser, the general situation of several non-interface browser is as follows:
1,CASPERJS,PHANTOMJS: Non-py, can be called through the command line, the function is basically satisfied, it is recommended to look at whether these two are satisfied. More mature. Phantomjs also has an unofficial Webdriver protocol implementation, Thus, no interface can be realized by selenium phantomjs.
2,ghost,spynner, such as: Py Custom WebKit, personally think Spynner code chaos, Ghost code quality is good. But there are bugs. I've seen a few of these libraries and changed one myself.
The details of such a programme are shown below.

Finally, there is a choice, on the basis of the JS execution engine, the implementation of a lightweight JavaScript-free browser. Unless you have a very, very, very, very, very, very, very, very, very, very, very, very, very, very important thing to crawl. If you have this idea, you can look at PYV8, In V8 's sample code, there is a simple browser model based on the V8 implementation. Yes, just a model, not entirely available, you have to fill in some of the methods inside. Implementing these you need to implement these features on the JS engine (V8), HTTP library (URLLIB2), 1, When the page is opened, get its containing JS code, 2, build a browser model, including various events with the DOM tree. 3, execute JS. There may be some other details.
Online can find a shopping comparison crawler of a related ppt. The crawler also uses only the third option. You can look at this ppt. The crawler is probably used webkit,scrapy, in addition to the Scrapy scheduling queue to be based on Redis, to achieve distributed.

How to achieve:

Back to a bit of background knowledge, Scrapy used twisted. An asynchronous network framework. Be aware of the potential blocking situation. But notice that there is a parameter in settings that sets the degree of parallelism of the itempipeline. It is assumed that pipeline will not block, Pipeline may be performed (not verified) in the thread pool. Pipeline is generally used to save the captured information (write a database, write a file), so here you don't have to worry about the time-consuming operation will block the entire framework, you do not have to implement this write operation in pipeline asynchronous.
In addition to other parts of the framework. It's all asynchronous, simply put, a crawler-generated request is sent to the scheduler to download, and then the crawler resumes execution. When the scheduler finishes downloading, the response is referred to the crawler for parsing.

Online to find the reference example, part of the JS support written to the Downloadermiddleware, scrapy official website Code snippet is also the case. If this is done, the entire framework is blocked, and the crawler's working mode becomes, download-parse-download-Parse, Instead of being a parallel download. There is little problem in small-scale crawls where efficiency is not required.
A better approach is to write JS support into Scrapy's downloader. There is one such implementation on the Internet (using SELENIUM+PHANTOMJS). Only get requests are supported.

There are a variety of details to be dealt with when adapting a webkit to Scrapy's downloader.

Http://www.qytang.com/cn/list/28/463.htm
Http://www.qytang.com/cn/list/28/458.htm
Http://www.qytang.com/cn/list/28/455.htm
Http://www.qytang.com/cn/list/28/447.htm
Http://www.qytang.com/cn/list/28/446.htm
Http://www.qytang.com/cn/list/28/445.htm
Http://www.qytang.com/cn/list/28/444.htm
Http://www.qytang.com/cn/list/28/442.htm
Http://www.qytang.com/cn/list/28/440.htm
Http://www.qytang.com/cn/list/28/437.htm
Http://www.qytang.com/cn/list/28/435.htm

Scrapy Custom Crawler-Crawl javascript---Yi Tang

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.