The new "crawler" mechanism of Ajax applications

Source: Internet
Author: User

With the massive popularity of the so-called Web 2.0 APPLICATION represented by Ajax drivers, a series of huge new challenges have begun to accelerate the arrival of the current search engine, the establishment of a new "crawler" mechanism is imminent. The reason is as follows:

  • The Ajax application subverts the previous "crawler" mechanism based on the pure HTTP Request/response protocol, that is, by default, all page resources are directly triggered and directed by the hyperlink, the existing "crawler" only needs to simulate the user's hyperlink request and parse the corresponding response page, and then analyze the page content, semantics, and derived hyperlink for "crawling ".
  • The so-called Ajax, that is, Asychronous Javascript and XML, is the biggest difference with the previous applications is that the request/response driven by hyperlink is more switched to the asynchronous request/response driven by Javascript. For existing "crawlers", they lack semantic understanding of Javascript, and it is difficult for them to simulate and trigger Javascript asynchronous calls and parse the returned asynchronous callback logic and content.
  • For traditional Web application "crawlers", the DOM structure of each page is relatively static by default, and this precondition is once again overturned in the new Ajax application, for user operations, Javascript will greatly change the DOM structure, and even all the content on the page is read directly from the server and dynamically drawn through Javascript.

For the above changes and new challenges, Shreeraj Shah posted a paper titled "crawler Ajax-driven Web 2.0 Applications" on Infosec Writers.

In this article, the author points out that the traditional "crawler" engine is mostly protocol-driven, while the new "crawler" engine requires event-driven. To drive events in the new engine, you must consider the following three key issues:

  1. Interactive Analysis and Interpretation of Javascript
  2. Distribution of DOM event processing and interpretation
  3. Dynamic DOM content semantic Extraction

In addition, the author uses rbNarcissus, Watir and Ruby languages and tools to demonstrate these problems and possible solutions. If you are interested, you can read it carefully. I will not introduce it here.

In the next generation or next generation of Web applications, this crawler mechanism has problems in addition to Ajax, in other Flash/Flex and WPF/E, XUL applications are also common. However, for Flash/Flex, the problem may be more serious, because all the Actionscript code is compiled and executed, while improving the efficiency, it undoubtedly brings incredible continuous challenges to crawlers!

Tag: ajax Crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.