a framework of passive crawler task based on browser kernel
The existing browser-based client testing framework is actively controlled by chromedriver such components, but active control has drawbacks:
- Re-load the next page, the previous page may still have JS code in the execution, or network layer connection blocking, UI thread blocking what;
- JS code injected through the WebView interface may fail to be notified because of unexpected errors in various situations
- Unable to reliably query the browser for information on whether the current task has been completed
Here is a simple idea and process for passive control based on the modified chromium kernel:
- When the browser launches, a ' crawler task description ' is obtained through a ' given task Web service URL ', which contains the following information:
- Target_url: The URL of the page to crawl
- Optional task_id (if the Target_url attribute is not available for a unique description of the task)
- INJECT_JS: The client crawler JS code to inject
- Post_notify: An HTTP POST address when a task is completed
-
(this means that the task description needs to be saved as a record in the database, containing the ' whether ' issued ', ' completed ' information)
-
key points are The idea of "distributed micro-service Architecture + stateless application + asynchronous completion notification"
- Next, the browser opens Target_url, when loadfinished, downloads the Inject_js (can be cached on the main UI process side) and loads the injected
- Inject the client Crawler JS code to start analyzing the target page DOM tree structure, extract the content information, and Ajax post to Post_notify
- Post_notify server-side Accept crawler crawl data, update ' Crawler task description ' table record, and return OK confirmation
- Injected JS code to notify the browser access task issued URL, get the next new ' Crawler task description '
- JS control may be unreliable, causing the browser to be in a ' zombie ' state, so you can start a watchdog timer on the main UI process side, forcing the reset state
Modification point to the kernel (TODO):
- The current ' Crawler task description ' is stored in the form of C + + data structure to chromium's main UI process side;
- Added a new application/json-crawler-task MIME type to parse the JSON-formatted Crawler task description (main difficulty)
- loadfinished inject_js Download, cache, and inject code at callback
- Watchdog timer (may not be required)
- How do I implement a browser-based RSS client crawler? Generic XML document is not HTML, cannot inject JS code execution
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
A framework of passive crawler task based on browser kernel