Client crawler framework based on Devtools protocol +chromium headless

Source: Internet
Author: User
Tags new set

The previous practice was to use PHANTOMJS and a html+ nested iframe to contain the simple performance optimizations of the target site url+ cross-domain DOM operations.

PHANTOMJS implements the following core requirements:
(1) Headless mode , however phantomjs the kernel is based on the old version of Qtwebkit, Compared with the latest version of the chromium code, the version is too old, many features can not be used (although the current domestic website should also not start to use these?) such as Serviceworker, CSS Custom Properties, Web components, and so on)
(2) can wait for PHANTOMJS after the page load completes, inject JS execution, between the need to support the following core features:
    2.1 can be injected into the JS code method to import external JS resources, generally dynamic insert a <script> element, and set the SRC attribute;
    2.2 You can use the injected JS code to control the scroll of the page automatically, or directly call the relevant JS Event handler function According to the code of the specific website
    2.3 Of course, PHANTOMJS should allow user input events to be triggered directly from the outside, such as locating an element based on a CSS Selector path and then triggering its Click event. The browser itself usually does not support this, but browser extensions such as Webdriver, which were originally designed to automate web testing, support this

Now the question is, Devtools Support does not support these core requirements?

devtools is a protocol based on WebSocket Communication and the request response data is in JSON format and can be connected to chrome for on your phone using a desktop version of Chrome PC browser Android Browser for remote debugging, but what if I don't need this UI interface? Here the Client crawler framework needs to be automated processing, rather than ui+ user interaction,

Check List: One by one check devtools protocol support does not support the above 3 core feature requirements???

The headless top-level module in the Chromium kernel appears to have been added from the M49 version, providing multi-process, headless-mode, Devtools-enabled automation support.

The Headless module's compilation setting removes the dependency on the GPU. (But it is unclear to what extent does it achieve the headless mode??? If it is fully supported, it should be possible to compile a Windows command-line executable successfully, and rely only on network and file Io, multithreaded/ipc, which do not require the UI interface to participate in the system API)

Problem (TODO):
(1) need git to get the latest version of the chromium kernel code, and use the new GN instead of Gyp to compile

PS: But the client crawler framework I originally envisioned was to provide a new set of script binding for the page Dom tree (such as using LUA), or it could be implemented as a DSL-specific crawler task description language. This is good:

(1) Do not need to use JS to write cumbersome code
(2) the page native JS code completely does not feel the crawler in the crawling data existence, but the external injection JS method theoretically can not avoid

But the downside is:
(1) Implementation is difficult, you may need to modify the kernel code level, add a new set of bindings API, but I never seem to have heard how to bind a scripting language other than JavaScript?
(2) This new scripting language is still restricted: for example, it should not modify the DOM, but even so, given that the page native JS may be concurrently modifying the DOM, this place cannot have read-write conflicts, and it should provide an API for locating elements based on some syntax (CSS can be reused selector); It should support the custom serialization data export of Dom range objects, and it needs to allow the user input events to be injected programmatically (script automation) so that the page-native JS code "thinks" has occurred scroll, gesture, Click and so on UI interaction.

The client crawler also needs to be able to load the "AD filtering interception" function, which can be modified on the basis of headless code, but the existing method html+ When the IFRAME handles the 36KR Web site, it's a problem: it filters out all the <script> elements when setting the Srcdoc property of the IFRAME as an HTML string, which causes the page content not to render as DOM properly. The reason is that 36KR now uses React.js+json data to do render.

In fact, if directly based on the JSON data to do the reptile data export is not not ...


Client crawler framework based on Devtools protocol +chromium headless

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.