Part 1: Enable the web page to enter the search engine index

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How many pages are indexed on my site?

If you want to know how many pages are indexed on your site, perform a simple test first. Go to Google or other search engines you like and search for your company name. If the company name is a common name (such as AAA plumbing or Acme industries), then add the region (AAA plumbing Peoria) or the company's most famous product (ACME industries sheet metal ), check whether your site has been found.

If you find a web siteFundamentalsThere are usually two reasons for not being indexed:

The site is new. If the web site is just created and no other site is linked to it in the search index, the search engine has not found it. In this case, you only need to link some other sites to your site.
The site is forbidden.If the search engine considers your site to adopt an immoral (Black Hat) Seo approach, all your pages may be deleted from their indexes. If you find yourself in this bad situation, find a search and marketing specialist to analyze the site and find out where it violates ethics. After you fix the problem, request to search engines for "forgiveness ".

If you are lucky enough to enter the company name in the search engine, at least one page on your web site will be found. Generally, any specific search engine only indexed some of your pages, but it would be better if almost all pages were indexed. The more pages that are not indexed, the more likely your site's potential visitors will be directed to your competitors (if their pages are indexed ).

Inclusion Rate

First, computingProbability sion Ratio)That is, the percentage of pages indexed by search engines to the total number of pages. Of course, the ideal inclusion rate is 100%, but it can be satisfactory if it is slightly lower. If less than 50% of the pages are included in the search index, take it seriously.

To calculate the inclusion rate, divide the number of pages in the search engine index by the total number of pages on your site. If your web site is relatively small, it may be easy to estimate the total number of pages on the site, but it is sometimes difficult to find out how many pages there are for large sites. For large websites, you can use several methods to estimate the number of pages:

Ask the web administrator. The web administrator must have been asked this question before. He may have done research.
Count the number of documents in the Content Management System. Generally, each document creates a unique page, so this gives an estimate of the number of pages.
Tools used: Optispider or XenuProgramThe system checks the site and reports how many pages are found (see references ).

After estimating the size of a web site, we need to find out how many pages of the site are indexed. Google, Yahoo! Both search and MSN Search provide the "site:" operator, which reports information you need to know. InputSite:, Followed by your domain name (suchSite: kodak.com) To view the returned results. The more convenient tool is the free marketleap tool saturation reporting tool (see references), which displays the number of pages of any site in each search index.

Back to Top

Crawler path

What should I do if the result of calculating the inclusion rate is poor? First, let's review how search engines index pages. The search engine is specially designedSpiderOrCrawler)To check the page on the site.

Crawlers collect the HTML of each page and record the links to other pages, so that they can then collect the HTML of these pages. As you can imagine, after a long enough time, the crawler will eventually find each page on the web (at least each page that is linked to another page ). Obtain the page, find all the links on the page, and then obtain the pages to which the link is called "crawling on the Web ".

Because crawlers work like this, creating links to each page can simplify the tasks of indexing sites-we call these technologiesSpider path). Your site already contains a path and may already have the most important crawler path type: site map. If a site contains only a few pages, the site map can be listed and linked to each page on the site.

However, a site map should not contain more than 100 links. Therefore, a large site map must be linked to a category page, which is then linked to other pages on the site. The largest web site is usually divided into sub-stations for various countries. This requires a special site map calledCountry Map), Which lists the names of each country and links to the home pages of each country site. Crawlers like this technology very much. (For more information, see examples of large site maps .)

Only when a crawler arrives at your site can the site map be used, but there are more active ways to index the page. Google and Yahoo! Are providedInclusion sion Program)Used to index pages. Google's beta program is called sitemaps (see references). It is free of charge and provides several methods to notify Google crawlers of page locations. You can even request Google to perform more frequent index updates on some of your pages. Yahoo! Provides a paid include program sitematch (see references), which promises to reindex your page within 48 hours. (Google has not made a commitment to time .)

RSS feed provides another way to quickly index pages when they are published. Use Ping-O-Matic! (See references) notify the search engine that a new entry is added to the RSS feed, which is often indexed within a day or two.

Back to Top

Clear crawler path

The hiking team should allow the road drivers to explore and mark the route forward, but the road drivers must clean these paths frequently so that the paths will not be damaged or left empty. The same is true for Crawler paths. Unless you check them frequently, they may be blocked.

If you ignore the way crawlers work, The crawler path can easily become a crawler trap. Good pages for people may block crawlers. Crawlers are automatic, so they do not enter a registration form as human visitors do. If the operation required to link to the page on the site is not just along the HTML anchor tag, the link may be hidden from the crawler.

This means that JavaScript, Flash, frames, and cookies may also cause problems. If your web pages cannot be displayed without these technologies, the pages will not be indexed by crawlers. In addition, if you need these technologies to use the link, the crawler will not be able to move forward along the link.

Crawlers only view HTMLCode, Just like a screen reader for users with visual impairment. To understand what crawlers see, you can disable the browser's support for cookies, JavaScript, and graphics when viewing the page, or use the text-mode Lynx browser or Lynx Viewer (see references ). If the page can be fully displayed using lynx, they are likely to be indexed. Pages that are not displayed or are not completely displayed are not easily found by search engines.

Even if you avoid using these troublesome technologies, it may still impede crawlers. Crawlers have strict requirements on the correctness of HTML code-the browser must be more tolerant. A page that looks good in the browser may block the crawler, which will make the crawler unable to see or misunderstand the whole page or part of the page. These errors can be found in the HTML Inspection Service (see references) and the Firefox browser.

You must also pay attention to the crawler's limit on the Content size of each page. Most crawlers only index the first 100,000 characters on the page. This number seems to be large, but if you add javascript programs and style sheets to the page, or put the entire User Manual into a PDF file, the limit will soon be reached. Therefore, you can consider dividing the Manual into one PDF for each chapter and transferring all JavaScript and style sheet code to an external file.

Back to Top

Welcome Crawler

After the crawler path is cleared, make sure that the crawler is popular. The most obvious suggestion is that when a crawler arrives, make sure that the site is running and can respond. Because you do not know when a crawler will visit your site, frequent Shutdown (that is, "maintenance time window") is risky. If a crawler stops at the site, it will think that the site is invalid and then go to other sites.

If the site's response speed is very slow, this is almost as bad as it is completely ineffective, because crawlers run according to schedule. For slow sites, they have fewer indexed pages, and the frequency of re-access is lower, because they can process more pages elsewhere in the same time.

Even if your site is usually not shut down and fast, it may still be caused by incorrect writing.Robot instruction (robots instruction)The crawler is rejected. The robots.txt file can be used to prevent crawlers from certain pages, directories, or the entire site. Therefore, if the site instruction is written incorrectly, the crawler may be evicted. In addition, each page can have a robots tag that instructs the crawler whether to index the page and whether to move forward along the Link (see references .)

Back to Top

Crawler retaining

Even if your website welcomes a crawler, it cannot be guaranteed that it will not abandon it in the future.

A problem that will impede crawlers is the use of long dynamic URLs for pages. Many dynamic URLs use parameters to select the content to be displayed, such as the French description of product 2372 from the Canada product catalog. Crawlers are disgusted with these dynamic sites, because the combinations of parameters are almost infinite-crawlers do not want to get lost in the site. When a crawler sees that the URL contains more than 1,000 characters or has more than two parameters, they will usually skip these pages.

If your site has these problematic URLs, you must refer to the Web server documentation to learn how to change the URL format to make crawlers satisfied. For example, Apache uses the "mod_rewrite" function (see references) to modify URLs. Other Web servers also have similar functions.

The so-called "session identifier" can also scare crawlers. Some programmers create a parameter in the URL to capture information about the current visitor (it is often identified by "id =" and a unique alphanumeric code ). Crawlers hate this technology because it causes hundreds of thousands of different URLs to display the same content. The programmer should store this information in the Session Layer or cookie of the Web application server. (However, as discussed earlier, the display page shouldNoCookie. Otherwise, the crawler cannot index it .)

After analyzing the dynamic page, pay attention to another problem that may cause inconvenience to the page.RedirectionThis technology tells the browser and crawler that the requested URL has changed. For example, if your company changes its name, it may also change the Domain Name of the web site, so redirection can forward all visitors from the old URL to the new URL. However, there is only one effective redirection method for a crawler: server-side redirection, also known as 301 redirection (see references ). Other redirection technologies are effective for browsers, such as metadata refresh redirection and JavaScript redirection. However, crawlers cannot move forward along these redirection paths, which will prevent the redirected pages from being indexed for search.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Part 1: Enable the web page to enter the search engine index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Part 1: Enable the web page to enter the search engine index

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support