Introduction to the spider Technology for simulating IE (Firefox)

Source: Internet
Author: User
Author: rushed out of the universe
Time: 2007-5-21
Note: Please indicate the author for reprinting.
The spider technology is mainly divided into two parts: a simulated browser (ie, FF, etc.), and a page analysis. The latter may be considered not a spider. The first part is actually a project problem, which requires a relatively regular time building, and the second part is an algorithm problem, which is hard to say.
1. Why do I need to simulate ie?
However, some may come up with an idea: why should I simulate ie? Using the libraries provided by IE is much simpler. Well, this friend may be from an academic school, that is, he only needs this function to perform experiments (many PAGE analysis algorithms also use the IE Library as the basis for design ). Internet Explorer has two major problems: 1) slow page access speed; 2) too long page processing time. The former is because IE needs to access all the data to be displayed on the page (such as images, Flash, and JS), although it is multi-threaded, however, even if downloading is faster, it will consume too much bandwidth. The latter is because IE needs a typographical interface (for example, where the table is displayed ), this also takes a lot of time (however, there is a benefit that it is easy to know the absolute position of the element. This is for some algorithms, such as Visual page blocks, ). The spider on the project won't waste so much time and bandwidth. Therefore, you need to build the spider simulating ie on your own.
2. What components does the simulation ie contain?
Text is the only important thing for text-based search engines. With this in mind, we need to support functions similar to IE at least: 1) Cookie support; 2) page encoding support; 3) JS support. Cookie support allows us to access resources that require user names and passwords (many forums must log on first ). Page encoding is not an important issue, but because many pages do not pay attention to coding, it is also a problem. Fortunately, it is not a problem. JavaScript support is the most important step to simulate IE. This is because Javascript is widely used on most pages in China. In the future, Ajax will surely be everywhere. JS is a step behind.
In addition, CSS is commonly used in page design. However, considering that CSS is only a page layout specification, it has little to do with the actual displayed text, so we temporarily ignore it.
3. How to simulate ie?
Corresponding to Cookie and page encoding support, this is a very detailed technology, this is not to mention. For JavaScript simulation, we may only have the only option: Rhino (or its c Implementation of spidermonkey ). Rhino is a well-known open-source library for running JavaScript scripts in Mozilla. It provides a basic library that supports interpreting and running JavaScript scripts. Its official website is http://www.mozilla.org/rhino /. Rhino supports General ECMA standard JS scripts.
In fact, we have another option. We also use the open source code of Mozilla. This option is to extract the JS processing code from the source code of Firefox. That is to say, we can extract all the other parts of Firefox except gecko and other graphics and security components. Unfortunately, our previous efforts in this area ended in failure. For a project currently related to this, refer to mozillahtmlparser: http://sourceforge.net/project/showfiles.php? Group_id = 186646, which is a project that extracts htmlparser from Firefox source code. It should be noted that, according to the source code of Firefox, its Processing of HTML pages does not include script analysis (it seems that the script tag is ignored ).
4. What are the shortcomings of rhino?
Rhino only provides basic JS parsing and running functions that meet the ECMA standard, which means it has many shortcomings. At least:
1) rhino does not contain the browser host
The so-called Browser host is a DOM object that meets the html4.0 standard. For example, rhino does not support direct calling:
Document. writeln ("http://lotusroots.bokee.com> out of the universe ");
Rhino does not provide document objects
According to the HTML standard, there are many host objects, which must be supported by ourselves.
2) there is a conflict between ECMA and the current actual JS standard.
There are many conflicts of this type. A typical example is that float is an absolute keyword in ECMA, and the actual CSS Norms contain the float attribute. Therefore, you may use the following code:
Style. Float = "left"
This line of code cannot be compiled in rhino (rhino is first 'compiled 'and then executed ).
3) serious lack of rhino documentation
As we all know, the main problem with an open-source project is the extreme lack of documentation. No document tells you how to use the core architecture and all interfaces. Like rhino, only a few simple example files can be found. If you want to know more in-depth information, you can only view comments of code and code in one line.
5. What are the host objects?
The HTML standard (DOM standard) only defines the objects that most browsers must support (these objects are not supported by rhino), such as htmldocument and htmltable. Among them, the dom1 standard is currently the most used, some parts of the dom2 standard will also be used on some pages, and the dom3 standard almost never used pages.
In addition, there are many objects that are not provided in these standards, but are very common. For example, XMLHTTPRequest object that supports Ajax.
To fully support General browser JS functions, we recommend that you refer to IE standards. Especially the ie7.0 standard, because this version of IE supports almost all Dom standards and most accepted standards used by Mozilla (such as XMLHttpRequest ). Considering the most common factors, at least the following host objects must be supported:
Attribute, clientinformation, form, framewindow, history, document, input, location, Navigator, option, screen, script, select, style, table, tbody, TD, TR, textarea, text, title, window, xmldocument, XMLHttpRequest, etc.
After supporting these objects, more than 90% of Chinese pages can be satisfied based on our experience.
Vi. Practical results
In actual use, you need to reduce file downloads. Therefore, you must buffer JS files. In this way, you only need to download one page at a time. After downloading the page, run the corresponding JS script and startup event script. The initialization is complete. If you want to obtain the link in the page, you need to process the code of href and onclick.
We use the built browser simulation module to access and traverse the deep network. In this way, we only need to configure a searcher (a common searcher can also be automatically built, however, there are few common searchers in China. The searcher on Google and Baidu pages is common), and other parts are automatically completed.
During the actual test, we found that the execution of JS scripts took less than 100 ms. Several pages can be processed per second.
Simply put, the construction of the simulated IE browser module is to reduce manpower. This is required when many websites are traversed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.