With the massive popularity of the so-called Web 2.0 APPLICATION represented by Ajax drivers, a series of huge new challenges have begun to accelerate the arrival of the current search engine, the establishment of a new "crawler" mechanism is imminent. The reason is as follows:
- The Ajax application subverts the previous "crawler" mechanism based on the pure HTTP Request/response protocol, that is, by default, all page resources are directly triggered and directed by the hyperlink, the existing "crawler" only needs to simulate the user's hyperlink request and parse the corresponding response page, and then analyze the page content, semantics, and derived hyperlink for "crawling ".
- The so-called Ajax, that is, Asychronous Javascript and XML, is the biggest difference with the previous applications is that the request/response driven by hyperlink is more switched to the asynchronous request/response driven by Javascript. For existing "crawlers", they lack semantic understanding of Javascript, and it is difficult for them to simulate and trigger Javascript asynchronous calls and parse the returned asynchronous callback logic and content.
- For traditional Web application "crawlers", the DOM structure of each page is relatively static by default, and this precondition is once again overturned in the new Ajax application, for user operations, Javascript will greatly change the DOM structure, and even all the content on the page is read directly from the server and dynamically drawn through Javascript.
For the above changes and new challenges, Shreeraj Shah posted a paper titled "crawler Ajax-driven Web 2.0 Applications" on Infosec Writers.
In this article, the author points out that the traditional "crawler" engine is mostly protocol-driven, while the new "crawler" engine requires event-driven. To drive events in the new engine, you must consider the following three key issues:
- Interactive Analysis and Interpretation of Javascript
- Distribution of DOM event processing and interpretation
- Dynamic DOM content semantic Extraction
In addition, the author uses rbNarcissus, Watir and Ruby languages and tools to demonstrate these problems and possible solutions. If you are interested, you can read it carefully. I will not introduce it here.
In the next generation or next generation of Web applications, this crawler mechanism has problems in addition to Ajax, in other Flash/Flex and WPF/E, XUL applications are also common. However, for Flash/Flex, the problem may be more serious, because all the Actionscript code is compiled and executed, while improving the efficiency, it undoubtedly brings incredible continuous challenges to crawlers!
Tag: ajax Crawler