Scrapy Custom Crawler-Crawl javascript---Yi Tang

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many websites use JavaScript ... Web content is dynamically generated by JS, some JS events triggered by the page content changes, links open. Even some websites do not work at all without JS, and instead return you with something like "Please open browser js".

There are four solutions for JavaScript support:
1, write code to simulate the relevant JS logic.
2, call an interface browser, similar to a variety of widely used in testing, selenium this kind.
3, using a non-interface browser, a variety of webkit-based, CASPERJS,PHANTOMJS and so on.
4, combined with a JS execution engine, to implement a lightweight browser itself. It's a lot of difficulty.

For a simple limited crawl task, if you can simulate the JS logic through the code, the preferred option, for example, in the DuckDuckGo search engine, the page flip This action is triggered by JS. The simulation seems to be difficult, and then I notice the second form of his page, which seems to be able to page the pages after a submit. Try it and it sure does.
When writing code to simulate the relevant JS logic, first try to close the browser's JS, to see if you can get what you need. Some pages offer no JS compatibility. No, no, No. Chrome's console or Firebug observe the JS logic, possibly Ajax. With URLLIB2 ( Recommended requests library) simulation can be, or modify the DOM, and so on, with lxml such as the corresponding modification can be. That's what JS does, and the Python code corresponds to the simulation execution.

You can also choose to use selenium this type, the disadvantage is that the efficiency is very low, you should first Test selenium to start a browser instance of the time it takes you to accept. This time is usually in the second level. Again, the browser opens the page rendering, which is even slower. The plan is also good, given the acceptable efficiency.
Another problem with this scenario is that the selenium does not work visually on servers without a desktop environment.

The scale is not small, analog JS is not available, selenium efficiency is too low, or need to be executed in the absence of desktop environment. There are no interface browser, the general situation of several non-interface browser is as follows:
1,CASPERJS,PHANTOMJS: Non-py, can be called through the command line, the function is basically satisfied, it is recommended to look at whether these two are satisfied. More mature. Phantomjs also has an unofficial Webdriver protocol implementation, Thus, no interface can be realized by selenium phantomjs.
2,ghost,spynner, such as: Py Custom WebKit, personally think Spynner code chaos, Ghost code quality is good. But there are bugs. I've seen a few of these libraries and changed one myself.
The details of such a programme are shown below.

Finally, there is a choice, on the basis of the JS execution engine, the implementation of a lightweight JavaScript-free browser. Unless you have a very, very, very, very, very, very, very, very, very, very, very, very, very, very important thing to crawl. If you have this idea, you can look at PYV8, In V8 's sample code, there is a simple browser model based on the V8 implementation. Yes, just a model, not entirely available, you have to fill in some of the methods inside. Implementing these you need to implement these features on the JS engine (V8), HTTP library (URLLIB2), 1, When the page is opened, get its containing JS code, 2, build a browser model, including various events with the DOM tree. 3, execute JS. There may be some other details.
Online can find a shopping comparison crawler of a related ppt. The crawler also uses only the third option. You can look at this ppt. The crawler is probably used webkit,scrapy, in addition to the Scrapy scheduling queue to be based on Redis, to achieve distributed.

How to achieve:

Back to a bit of background knowledge, Scrapy used twisted. An asynchronous network framework. Be aware of the potential blocking situation. But notice that there is a parameter in settings that sets the degree of parallelism of the itempipeline. It is assumed that pipeline will not block, Pipeline may be performed (not verified) in the thread pool. Pipeline is generally used to save the captured information (write a database, write a file), so here you don't have to worry about the time-consuming operation will block the entire framework, you do not have to implement this write operation in pipeline asynchronous.
In addition to other parts of the framework. It's all asynchronous, simply put, a crawler-generated request is sent to the scheduler to download, and then the crawler resumes execution. When the scheduler finishes downloading, the response is referred to the crawler for parsing.

Online to find the reference example, part of the JS support written to the Downloadermiddleware, scrapy official website Code snippet is also the case. If this is done, the entire framework is blocked, and the crawler's working mode becomes, download-parse-download-Parse, Instead of being a parallel download. There is little problem in small-scale crawls where efficiency is not required.
A better approach is to write JS support into Scrapy's downloader. There is one such implementation on the Internet (using SELENIUM+PHANTOMJS). Only get requests are supported.

There are a variety of details to be dealt with when adapting a webkit to Scrapy's downloader.

Http://www.qytang.com/cn/list/28/463.htm
Http://www.qytang.com/cn/list/28/458.htm
Http://www.qytang.com/cn/list/28/455.htm
Http://www.qytang.com/cn/list/28/447.htm
Http://www.qytang.com/cn/list/28/446.htm
Http://www.qytang.com/cn/list/28/445.htm
Http://www.qytang.com/cn/list/28/444.htm
Http://www.qytang.com/cn/list/28/442.htm
Http://www.qytang.com/cn/list/28/440.htm
Http://www.qytang.com/cn/list/28/437.htm
Http://www.qytang.com/cn/list/28/435.htm

Scrapy Custom Crawler-Crawl javascript---Yi Tang

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Custom Crawler-Crawl javascript---Yi Tang

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy Custom Crawler-Crawl javascript---Yi Tang

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support