How search engines index Web pages

Last Update:2014-12-23 Source: Internet

Author: User

Keywords Search engine Google

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

SEO (Search engine optimization), so that the pages within the site can be timely and comprehensive search engine index, included should be said to be the first task, this is the implementation of other SEO strategy the most basic guarantee. However, this is often easy to overestimate a link, for example, we can often see some people claim that their site is Google included how many pages such as a few k or even dozens of K to prove the success of the SEO work. But objectively speaking, the Web page only by the search engine index, included is not much practical significance, often can only be reduced to the vast internet world of buried, more important is how to make the page appears in the specific search term SERP (search results page) The first few pages. Many people believe that to allow as many pages within the site to be included in the Search engine index database is not a bad thing, the more Web pages, exposure opportunities will be greater, although the ultimate effect of how there is doubt.

Anyway, if the site in the implementation of SEO will focus on the Web page is indexed, included in the speed and efficiency, of course, but also understandable, and to achieve this, we need to search engine how to collect, index Web page mechanism to understand. Below we take the Google as an example, introduces the search engine collects, the index webpage process, hoped can have after the friend to be helpful. For other search engines such as Yahoo!, Live Search and Baidu, although there may be differences in specifics, the basic strategy should be similar.

1, collect the URL of the page to be indexed

The number of Web pages that exist on the internet is absolutely astronomical, and countless new pages are added every day, and search engines need to first find the objects to be indexed.

specifically to Google, although there is a dispute as to whether there is a difference between Googlebot Deepbot and Freshbot--as for whether to call such two names, of course, the name itself is not important At least for now, the mainstream view is that in Google's robots, there really is a significant portion of the robots--for real indexed pages, which we'll call Freshbot. Their task is to scan the Internet daily to discover and maintain a large list of URLs for Deepbot to use, in other words, when they visit and read one of their pages, the goal is not to index the page, but to find all the links in the page. Of course, this seems to be a contradiction in efficiency, a bit less credible. However, we can simply judge by the following: Freshbot does not have "exclusive" when scanning a Web page, which means that multiple robots located in different data centers in Google may visit the same page in a short period of time, such as one day or even an hour, Deepbot, when indexing and caching pages, does not have a similar situation where Google restricts the work to a single data center's robots, without the fact that two data centers index the same version of the page at the same time, if there is no flaw in that statement, It seems that googlebot from the server access log can often be seen from different IP in a very short period of time to visit the same Web page to prove the existence of Freshbot. As a result, it is sometimes not too early to find Googlebot frequent visits to the site, perhaps not at all to index pages but just to scan URLs.

Freshbot recorded information includes the URL of the Web page, time Stamp (the timestamp of the page creation or update), And the head information of the Web page (note: This is controversial, there are many people believe that Freshbot will not read the target page information, but this part of the work to Deepbot completed. However, I tend to the previous version, because in the list of URLs submitted by Freshbot to Deepbot, the site settings are blocked from indexing, and the pages included are excluded to improve efficiency, In addition to the use of the Web site for this type of setting, besides using robots.txt, there is a considerable part of the Mata tag in the "Noindex" implementation, do not read the head of the target page seems to be unable to achieve this point, if the page is inaccessible, such as network outages or server failure, Freshbot will note the URL and timing retry, but will not add it to the list of URLs submitted to Deepbot until the URL is accessible.

Overall, the Freshbot is relatively small for server bandwidth and resource usage. Finally, Freshbot the record information according to the different priority classification, submits to the Deepbot, according to the priority difference, mainly has the following several:

A: Create a new Web page;
B: Old page/New time Stamp, that is, there are updated pages;
C: Use 301/302 redirected Web pages;
D: Complex dynamic URLs, such as dynamic url,google with multiple parameters, may require additional work to properly analyze their content. --as Google's ability to support dynamic Web pages improves, this classification may have been canceled;
E: Other types of files, such as links to PDF, doc files, indexes on these files may also require additional work;
F: Old page/Old time Stamp, that is, not updated page, note that the timestamp here is not based on the date shown in Google search results, but with the Google index database in the date match;
G: The wrong URL, that is, to return 404 response to the page;

Precedence is arranged in the order of A to G, which is lowered sequentially. It is important to emphasize that the priority is relative, such as the same new page, according to its link quality, number of different, priority is also a big difference, with the relevant authoritative site links from the page has a higher priority. In addition, the priority here refers only to pages within the same site, in fact, different sites have different priorities, in other words, for Web pages in authoritative sites, even their lowest-priority 404 URLs may be more advantageous than new pages with the highest priority for many other sites.

2, the index of the Web page and included

The next step is to enter the real index and the Web page process. As you can see from the above introduction, the list of URLs submitted by Freshbot is quite large, and depending on the language, site location, and so on, the indexing work for a particular site will be assigned to a different data center for completion. The entire indexing process, due to the sheer volume of data, may take weeks or even longer to complete.

As noted above, Deepbot will first index higher-priority sites/pages, the higher the priority, the faster it will appear in the Google index database and eventually appear on the Google search results page. For new Web pages, as long as you enter this stage, even if the entire indexing process is not completed, the corresponding Web page has the possibility of appearing in the Google Index Library, I believe many friends in Google to use "site:admin5.com" When you search for a page that is labeled as a supplemental result that displays only the URL of a Web page or a page with a title and URL but no description, this is the normal result of the page at this stage. When Google actually reads, analyzes, and caches the page, it escapes from the supplemental results and displays the normal information. -Of course, if the page has enough links, especially from authoritative sites, and there are no records in the index library that are identical or approximate to the content of the page (Duplicate content filtering).

For dynamic URLs, although Google Now claims that there are no barriers to its handling, the observable fact still shows that the chances of a dynamic URL appearing in supplemental results are much greater than those using static URLs, which often require more and more valuable links to escape from the supplemental results.

And for the category "F" above, a Web page that is not updated, Deepbot will compare its timestamp with the date in the Google indexed database, confirming that the corresponding page information in the search results may not be updated, but only if the latest version is indexed------------------------- Class is the 404 URL, it looks for the appropriate record in the index library and, if so, deletes it.

3. Sync between data centers

As we mentioned before, Deepbot indexing a Web page is done by a particular datacenter without multiple data centers reading the page at the same time, getting the latest version of the page, so that after the indexing process is complete, a data synchronization process is required to update the latest version of the page in multiple data centers.

This is the famous Google dance. However, after the BigDaddy update, synchronization between data centers is no longer concentrated in a specific time period, but in a continuous, more time-sensitive manner. Although there are still some differences between the data centers, the differences are small and the time for maintenance is very short.

To improve the efficiency of search engine indexed web pages, according to the above introduction, you can see that to make your Web page as fast as possible by the search engines included, at least from the following aspects to optimize:

Improve the number and quality of site backlinks, from the authoritative site links can make your site/page in the first time by the search engine "see". Of course, it's a cliché. From the above introduction can be seen, to improve the Web page by the search engine included efficiency, first of all to let search engines to find your Web page, link is the search engine to find the only way to the Web page-the "only" a little controversy, see the Sitemaps section below-from this perspective, Submitting Web sites to search engines is not necessary and meaningless, relatively speaking, to get your site to be included, access to external site links is the fundamental, at the same time, high-quality links is also to make the page step out of the key factors to add results.

Web design should uphold the principle of "search engine friendly", from the perspective of search engine spider design and optimization of the Web page, to ensure that the internal links to the search engine "visible", relative to the difficulty of obtaining links to external sites, reasonable planning of internal links is to improve search engine indexing and collection efficiency more economical and effective way- Unless the site is not included in the search engine.

If your site uses a dynamic URL, or if the navigation menu uses JavaScript, you should start from here when you encounter obstacles in your Web page collection.

Use Sitemaps. In fact, many people think that one of the main reasons Google has canceled Freshbot is the wide application of the Sitemaps (XML) protocol, which is that it can be accessed only by reading the sitemaps provided by the site, without Freshbot time-consuming and laborious scanning. This argument still makes sense, although it is not sure whether Google directly using Sitemaps as a Deepbot index list or as a Freshbot scan roadmap, but sitemaps can improve the efficiency of the site index collection is an indisputable fact. For example, SEO exploration has done the following tests:

Two pages, access to the same link, a join sitemaps and another did not join, appeared in the Sitemaps of the page is soon included, and another page after a long time before being included;

An island page that doesn't have any links to it, but adds it to the sitemaps for a while, it's also indexed by Google, but it appears in the supplemental results.

Of course, although the Web page does not appear in the sitemaps but still can be Google index can be seen, Google still use Freshbot or similar freshbot mechanism, of course, it is easy to understand, after all, there are still so many unused sitemaps sites, Google cannot shut it out.

For more information about SiteMaps, please refer to the "back door" of Google Sitemaps:google. It should be noted that today's Sitemaps agreement has become an industry standard, not only for Google, but also for other mainstream search engines including Yahoo!, Live Search and ask.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More