[Original] Some exploration on how Google search engine captures JS content

Last Update:2018-12-07 Source: Internet

Author: User

Tags file url

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Yesterday, Weibo sent a text message about Google's inclusion of JS, Ajax, and Flash content. A few of my friends were quite interested, so I would like to provide some details here.

First read this articleArticleThe premise is that the reader has an understanding of the indexing principles of the search engine. Some basic content has been described in detail in my book.

Here we will briefly review the general and Common indexing processes (some of the details have been omitted and some of the key steps are retained. We only look at the process and do not discuss special behaviors and whether they are reasonable ):

Submit your website, or your website is noticed by the search engine (through other external links ).
First, analyze your robots.txt file (if any) to determine which content needs to be filtered and which content can be directly followed.
By default, crawling starts from the default homepage (or specified webpage). The request process is equivalent to opening and obtaining the first text from the browser (200 successful ), in general, you will not deliberately request related content such as JS and CSS.
If you encounter a 302 or 301 jump, perform Step 1 on the page after the jump. Of course, there is usually a limit on the number of jumps, especially the endless loop (some Google functions also have pages in the Chinese state yesterday ). Few search engines do not support 30x redirection.
After obtaining the content, identify and include important content such as title, keywork, description, H1, and NAV, and address Encoding Problems (10 thousand words are omitted here ), then save the webpage to the search engine (assuming there are no violations or scams ). At the same time, the crawler analyzes the links in the web page (generally the href attribute in the <A> tag, of course, the "content" in <A> content </a> will also be considered as a lightweight keyword ). If you encounter a page of 40 x or 50 X, it will not be officially recorded and skipped for the next one.
For the analyzed link, determine whether it is within the scope of the site. If yes, return to the first article to continue crawling.

Let's get down to the truth.

Whether Google can actually include JS and Ajax content is a Web page thumbnail launched by Google, I accidentally found that the information obtained by a website through Ajax was normally displayed in the thumbnail (and processed after being returned through JSON ). So I am interested in this.

As a result, the following assumptions and experiments were made (some of the experiments started months ago and all the tests have not been completed yet, but some preliminary conclusions can be drawn for further verification ):

assume that Google can crawl the content in JS (a prerequisite for Ajax execution ).
experiment: put an Ajax process containing the complete URL directly (access directly using XMLHTTP and jquery at the same time. get (), and this URL only appears here in the entire website (network.
result: the URL (corresponding webpage) is included.
conclusion: 1) Except for the Zoomlion and tag and the HTML in the , the content crawled by Google has involved or at least not avoided the js in the
assume that Google can execute some simple JS content.
test: Split the URL in the above test Code (for example, split the "http://www.senpar.com/Home.xhtml? Id = 123 & add = 1 "split into" http://www.senpar.com/Home.xhtml? Id = 123 "+" & add = 1 ", or even encapsulate the results, such as" http://www.senpar.com/Home.xhtml? Id = 123 "+ getaddparams (), which does not affect the string results in JS, but will break the regular expressions and a series of judgment specifications. for robots that cannot execute JS, you can only include the first half of the string, or even the entire process with the plus sign included together (Note: such as a separate http://www.senpar.com/Home.xhtml? Id = 123
Error 404 is displayed ).
Result: After querying the indexing status and tracking by Google webmaster, Google considers the http://www.senpar.com/Home.xhtml? Id = 123 is 404 and cannot be found. The actual URL is not included.
conclusion: 1) Google robots cannot execute and parse the real JS content. What we see is that the pages captured from JS are extracted from simple rules, and try to access. 2) Since JavaScript code cannot be executed, Ajax access is almost impossible. 3) The effect of Ajax loading in thumbnails may be the responsibility of a special thread, which can execute JS or call the content in Ajax, it is equivalent to generating a page with the core of the browser (this is not a problem technically ). After careful observation, we found that this effect is quite similar to the effect in chrome (it can be tested using a series of hack methods ).
Suppose: Google can also include Flash content.
Test: (this test was done a long time ago, but this time I joined the Javascript control) place two flash contents (each containing a unique text and one unique link ), one part is loaded by JS, and the other part is loaded by using the <Object> code.
Results: The Flash files loaded by the two methods cannot be displayed in Google thumbnails (displayed in white or replaced with the flash plug-in status not installed), but the text content in Flash files is included.
Conclusion: 1) The Flash plug-in is not installed on the browser core (or intentionally) that generates thumbnails. 2) based on the SWF File URL, flash can be captured and analyzed separately.
Inference: if Javascript introduces flash, splitting the SWF file address may not be included. The principle is the same as the Ajax address in 2.
Suppose: Google can include Silverlight.
Experiment: place a Silverlight file (loaded through an object. JS is not judged here) and set a special image of the Silverlight status not installed.
Result: The thumbnail shows the image of the Silverlight status not installed. The content in Silverlight is not included.
Conclusion: 1) the core of the browser that generates thumbnails is not installed with the Silverlight plug-in. 2) Google crawlers cannot identify the Silverlight xap compressed package. (In fact, it is not difficult to decompress and crawl the Xmal file, the process is similar to crawling flash ).

Currently, these four are completed, and some results are to be determined. You are welcome to add.

From the above conclusions, we can also have some small inspirations, such as: when you want to be followed and indexed by a large number of links, but you don't want to display it on the page (or you want to use Ajax for asynchronous loading) -- you don't want to use the display: None method to cause the risk of cheating or IFRAME-you can use JS, make sure that the URL is in a complete string. On the contrary, you can avoid the risks of some URLs. You can also choose the robots.txt disallow. In addition, if your source is Google, you don't need to be so afraid of simple flash navigation, just keep your eye on it.

This article aims to be able to more accurately grasp some of the more hidden behaviors of search robots and help you better implement Seo. If there are any omissions or errors in this article, please add and point out :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More