Javascript processing in search engine PAGE analysis _ Javascript tutorial

Source: Internet
Author: User
Javascript: The & nbsp; javascript & nbsp; processing in the search engine PAGE analysis. When creating a search engine or performing PAGE analysis and data extraction, the Javascript tutorial, there are often many javascript in pages, and a considerable part of the page content is written into these js script commands, leading to normal DOM analysis failure and inability to extract the required information. Of course, if this page template is determined, it is not difficult to create an information extraction template for this page. Each page manually analyzes the location of the information to be extracted, and then creates a template. However, for general web search, This is not realistic. It happened that two days ago I had some ideas about this issue with my friends. Here, we provide two ideas for your reference. 1. Make a simplified javascript interpreter and execute a script snippet to make a complete javascript interpreter. However, it is easy to make a simplified javascript interpreter. We don't need complex libraries. We only need to implement the basic javascript syntax and implement some functions that involve text output. The purpose of this operation is not to fully execute this javascript, but to combine the strings in the script according to the program logic, and finally output the complete output of this script. This is naturally not comprehensive. It is certainly because many functions are not implemented, the output string is not exactly the same as the actual output string. However, if nothing happens,

When creating a search engine or performing PAGE analysis and data extraction, there are often many javascript in the page, which are annoying, because a considerable amount of page content is written to the commands of these js scripts, normal DOM analysis cannot see the text, and the text data extraction fails.

Of course, if this page template is determined, it is not difficult to create an information extraction template for this specific page. Each page can manually analyze the location of the information to be extracted, and then create a template. However, for general web search, This is not realistic. It happened that two days ago I had some ideas about this issue with my friends. Here, we provide two ideas for your reference.

1. Make a simplified javascript interpreter and execute script fragments.

It is rare to make a complete javascript interpreter, but it is easy to make a simplified javascript interpreter. We don't need complex libraries. We only need to implement the basic javascript syntax and implement some functions that involve text output.

The purpose of this operation is not to fully execute this javascript, but to combine the strings in the script according to the program logic, and finally output the complete output of this script. This is naturally not comprehensive. It is certainly because many functions are not implemented, the output string is not exactly the same as the actual output string. However, if there is no accident, there should be no too many omissions. Because all the output parts of the string are implemented, you can combine these strings according to the logic they will output.

If these conditions cannot be determined about dynamic tasks based on dynamic conditions, such as the browser type or something. You can output the results of both branches. Of course, we should not combine these two texts, and there should be separators we understand in the middle.

In this way, the performance is high. This interpreter can be very compact. Because Javascript is not fully executed, the performance is faster. The disadvantage is that it is a simplified interpreter, so it is different from the actual result. However, in general, there will be more and less information (because the results of different branches are output at the same time). Therefore, it is almost enough for the PAGE analysis of search engines.

2. Use the HTML Rendering Engine to parse the complete page, and finally retrieve data from the display results

Use the HTML Rendering Engine Gecko (Firefox) or Trident (mshtml. dll) (IE) for browsers to complete page parsing and rendering. Finally, the analysis results of these engines are analyzed.

In this way, the result is the closest to the display result because they are the real parsing result of the page. However, the disadvantage is that the performance is relatively poor, because it is a complete analysis of all elements of the page, so a lot of effort is useless to extract text information, if you analyze a page with a large amount of data, you need to weigh it out.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.