[Original] Some exploration on how Google search engine captures JS content

Source: Internet
Author: User
Tags file url

Yesterday, Weibo sent a text message about Google's inclusion of JS, Ajax, and Flash content. A few of my friends were quite interested, so I would like to provide some details here.

 

First read this articleArticleThe premise is that the reader has an understanding of the indexing principles of the search engine. Some basic content has been described in detail in my book.

Here we will briefly review the general and Common indexing processes (some of the details have been omitted and some of the key steps are retained. We only look at the process and do not discuss special behaviors and whether they are reasonable ):

 

    1. Submit your website, or your website is noticed by the search engine (through other external links ).
    2. First, analyze your robots.txt file (if any) to determine which content needs to be filtered and which content can be directly followed.
    3. By default, crawling starts from the default homepage (or specified webpage). The request process is equivalent to opening and obtaining the first text from the browser (200 successful ), in general, you will not deliberately request related content such as JS and CSS.
    4. If you encounter a 302 or 301 jump, perform Step 1 on the page after the jump. Of course, there is usually a limit on the number of jumps, especially the endless loop (some Google functions also have pages in the Chinese state yesterday ). Few search engines do not support 30x redirection.
    5. After obtaining the content, identify and include important content such as title, keywork, description, H1, and NAV, and address Encoding Problems (10 thousand words are omitted here ), then save the webpage to the search engine (assuming there are no violations or scams ). At the same time, the crawler analyzes the links in the web page (generally the href attribute in the <A> tag, of course, the "content" in <A> content </a> will also be considered as a lightweight keyword ). If you encounter a page of 40 x or 50 X, it will not be officially recorded and skipped for the next one.
    6. For the analyzed link, determine whether it is within the scope of the site. If yes, return to the first article to continue crawling.

Let's get down to the truth.

Whether Google can actually include JS and Ajax content is a Web page thumbnail launched by Google, I accidentally found that the information obtained by a website through Ajax was normally displayed in the thumbnail (and processed after being returned through JSON ). So I am interested in this.

As a result, the following assumptions and experiments were made (some of the experiments started months ago and all the tests have not been completed yet, but some preliminary conclusions can be drawn for further verification ):

 

  1. assume that Google can crawl the content in JS (a prerequisite for Ajax execution ).
    experiment: put an Ajax process containing the complete URL directly (access directly using XMLHTTP and jquery at the same time. get (), and this URL only appears here in the entire website (network.
    result: the URL (corresponding webpage) is included.
    conclusion: 1) Except for the Zoomlion and tag and the HTML in the , the content crawled by Google has involved or at least not avoided the js in the
  2. assume that Google can execute some simple JS content.
    test: Split the URL in the above test Code (for example, split the "http://www.senpar.com/Home.xhtml? Id = 123 & add = 1 "split into" http://www.senpar.com/Home.xhtml? Id = 123 "+" & add = 1 ", or even encapsulate the results, such as" http://www.senpar.com/Home.xhtml? Id = 123 "+ getaddparams (), which does not affect the string results in JS, but will break the regular expressions and a series of judgment specifications. for robots that cannot execute JS, you can only include the first half of the string, or even the entire process with the plus sign included together (Note: such as a separate http://www.senpar.com/Home.xhtml? Id = 123

    Error 404 is displayed ).
    Result: After querying the indexing status and tracking by Google webmaster, Google considers the http://www.senpar.com/Home.xhtml? Id = 123 is 404 and cannot be found. The actual URL is not included.
    conclusion: 1) Google robots cannot execute and parse the real JS content. What we see is that the pages captured from JS are extracted from simple rules, and try to access. 2) Since JavaScript code cannot be executed, Ajax access is almost impossible. 3) The effect of Ajax loading in thumbnails may be the responsibility of a special thread, which can execute JS or call the content in Ajax, it is equivalent to generating a page with the core of the browser (this is not a problem technically ). After careful observation, we found that this effect is quite similar to the effect in chrome (it can be tested using a series of hack methods ).

  3. Suppose: Google can also include Flash content.
    Test: (this test was done a long time ago, but this time I joined the Javascript control) place two flash contents (each containing a unique text and one unique link ), one part is loaded by JS, and the other part is loaded by using the <Object> code.
    Results: The Flash files loaded by the two methods cannot be displayed in Google thumbnails (displayed in white or replaced with the flash plug-in status not installed), but the text content in Flash files is included.
    Conclusion: 1) The Flash plug-in is not installed on the browser core (or intentionally) that generates thumbnails. 2) based on the SWF File URL, flash can be captured and analyzed separately.
    Inference: if Javascript introduces flash, splitting the SWF file address may not be included. The principle is the same as the Ajax address in 2.
  4. Suppose: Google can include Silverlight.
    Experiment: place a Silverlight file (loaded through an object. JS is not judged here) and set a special image of the Silverlight status not installed.
    Result: The thumbnail shows the image of the Silverlight status not installed. The content in Silverlight is not included.
    Conclusion: 1) the core of the browser that generates thumbnails is not installed with the Silverlight plug-in. 2) Google crawlers cannot identify the Silverlight xap compressed package. (In fact, it is not difficult to decompress and crawl the Xmal file, the process is similar to crawling flash ).

 

Currently, these four are completed, and some results are to be determined. You are welcome to add.

 

From the above conclusions, we can also have some small inspirations, such as: when you want to be followed and indexed by a large number of links, but you don't want to display it on the page (or you want to use Ajax for asynchronous loading) -- you don't want to use the display: None method to cause the risk of cheating or IFRAME-you can use JS, make sure that the URL is in a complete string. On the contrary, you can avoid the risks of some URLs. You can also choose the robots.txt disallow. In addition, if your source is Google, you don't need to be so afraid of simple flash navigation, just keep your eye on it.

 

This article aims to be able to more accurately grasp some of the more hidden behaviors of search robots and help you better implement Seo. If there are any omissions or errors in this article, please add and point out :)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.