For a long time, webmasters choose to use JavaScript to achieve the dynamic behavior of the Web page, the reason for this is varied, such as speed up the page response speed, reduce site traffic, hidden links or embedded ads. Because the early search engine does not have the corresponding processing ability, causes in the index this kind of page often to have the problem, may not be able to collect the valuable resources, may also appear to cheat.
The purpose of introducing JavaScript parsing is to solve both of these problems, and the result is to make the search engine clearer about the effect that users see when they actually open the page. For example, some sites will be user comments, ratings and other information from the HTML page, using JavaScript and even Ajax methods in the page is opened when the dynamic display, the early search engine at this time to deal with the content of the page is missing, which will further affect the value of the page index judgment.
To introduce JavaScript parsing, we need to consider the design and implementation, parsing speed and other aspects of the impact of the system, this article through a number of typical cases to analyze how to design and implement a Web page JavaScript parsing system, The function and influence of such system on other parts of search engine are briefly introduced.
One, Discovery page link
In general, page links are in HTML in the form of a tag, link URL tag in the href attribute, but there are some sites will choose more "dynamic" way, the more common way there are two: one is to dynamically write or adjust a label, The other is to trigger the event when the user clicks to change the default link open mode.
1. Dynamically write or adjust link labels
Abstract said that the Web page to achieve this effect, and even the other effects described in the following article, and the elephant into the fridge is very similar to the three steps: Find the target to write/modify (find the elephant), ready to write/modify the content (open the refrigerator door), write/modify (put it in).
This three-step operation maps to JavaScript, which calls three standard browser function functions: page element positioning, data preparation, and page modification. So, the work of JavaScript parsing is also to provide such a function, as the webmaster's JavaScript code calls the natural discovery of the corresponding content and behavior.
By this point, the functions needed to be implemented are almost certain, including the simpler:
document.getElementById//Positioning
document.getElementsByTagName//Positioning
Document.getelementsbyclassname//Positioning
Node. [Firstchild/nextsibling/previoussibling/parentnode]//positioning
Document. [Createelement/createtextnode]//Create link
Node. [appendchild/insertbefore/innerhtml=?] Write content
Element.getattribute, Element.setattribute//setting properties
Element.href =? Setting properties
As for what to write, it might be stored in a JavaScript form as an array, or it might be dynamically loaded with Ajax. The former is a built-in feature of the JavaScript language and is not repeated here; the latter is a separate topic that will be discussed later in this article.
2. Trigger events When clicked change the default link open mode
The reason for this page is different, some for hidden links, some for the implementation of pop-up windows, some for the program splicing URL, there are some checks to see if the link should be opened and so on. But all of these reasons correspond to the same implementation method: add Click events.
There are three ways to add the Click event:
Set the href attribute of a label to "Javascript:func (...)" The form
Set the OnClick property of a label, set to the form of onclick= "Js_code"
Call event-binding functions, such as My_link_node.addeventlistener (' Click ', Func, False)
It is simpler to support these three methods, and notice how to trigger such click events and how to intercept the destination URL after triggering.
For triggering an event, you first need to collect all the possible click events and then trigger them in turn. But for each click that is to be triggered, the actual trigger must first check if it still exists, because the click event before it is likely to have deleted the current click.
To achieve the interception URL, first of all to implement the relevant page jump function, both location.href =?, window.open and so on. And then by setting a series of flags, this click and page jump connected, so also get the target URL.
Second, dynamic page content
Page dynamic content is a way to enhance the speed of the page loading, enhance the technical flexibility of the Web site, you can change the content (such as comments, ratings, etc.), so that the page is divided into static and dynamic two parts: static content can use caching methods to speed up the page display speed, reduce site traffic Dynamic content has the advantages of simple format and easy generation, and it can also save traffic.
On the other hand, dynamic content is also an important way to load ads and content cheating, the most common is to write IFrame, which for the early search engine has great concealment.
At the technical level, the work required for dynamic page content is largely the same as the "dynamic write or adjust a tag" in the previous section, where the classic "document.write" method needs to be added.
This method is one of the earliest JavaScript functions, used to write a piece of HTML code directly to the page, and is still widely used today. Early search engines supported this approach, but the approach was largely limited to character matching, which only supported the most straightforward way of writing a JavaScript string, and was powerless for slightly more complex text stitching. But for JavaScript parsing, this code is to conform to the language specification, so you can do full support, processing text stitching, conditional judgment and confusing code and so on.
One thing to discuss here is the nested document.write, which is to write a script tag through document.write, which is a different document.write inside the tag. This type of problem is not uncommon in jump cheat pages, and it requires more than JavaScript parsing, and the HTML parser can support the processing of nested HTML writes, which is not analyzed here.
Through the above methods, whether it is the main information of the Web page, or advertising or other ancillary information, will be exposed to better understand the webmaster intent.
Third, Web page jump
Page jumps are a necessary choice to achieve a page effect in some cases, but they are also used for cheating. Technically, many of the following two ways are present:
Call page Jump function directly
For the search engine UA, Referer, etc. call page jump function
Here to realize the recognition, the core is to implement the page Jump function: Location object. Since this is the only technically unique JavaScript jump function, it will eventually be invoked regardless of how the page's JavaScript composition is confused. Therefore, although the jump codes for different pages look a wide variety, it is simple to identify them.
four, about Ajax
Ajax is a very common web technology, basically is during the Web page display, dynamic from the internet to obtain a piece of data (possibly HTML may be other), after processing to display.
For this technology, the fundamental work is not the realization of the XMLHttpRequest object, but the impact on the search engine crawler architecture. As we all know, crawler crawl pages, traverse its links, and then crawl the form of design, its work is mainly concentrated in the scheduling and control of grasping pressure, the gripper itself is relatively simple, usually do not have the ability to execute JavaScript and crawl Ajax data after grasping, so need technology upgrade side can support Ajax.
The analysis of the gripper is beyond the scope of this article, and interested readers can view other relevant literature.
Summarize
Through the previous case analysis, we have summed up the basic work required to achieve JavaScript parsing, in addition to adding a certain basic construction can constitute a more complete system. Here we'll sort it out again, dividing it into three parts:
1. Embed JavaScript language engine in HTML parser, the language engine can choose mature open source scheme such as V8, SpiderMonkey.
2. Implement the required functional functions and refer to the relevant HTML and DOM specifications of the consortium.
3. As a direct corollary, it is necessary to include the so-called. js file, which is the source code of the "parsing" required for JavaScript parsing.
The features described in this article are only a few of the more common JavaScript features, so that the search engine really see the actual page also need to further implement the required functionality, in addition to the HTML, CSS, pictures and other resources to support.
Finally, for the webmaster who wants to use JavaScript, this article gives the following suggestions:
1. Do not use too complex JavaScript technology, which is not conducive to the search engine included
2. Do not block the inclusion of. js files, otherwise it will limit the ability of JavaScript parsing
3. Reasonable division of static and dynamic parts of the site