How to talk to the search engine crawler

Source: Internet
Author: User
Tags format

Crawl strategy: Those pages are we need to download, those are no need to download, those pages are our priority to download, defined clearly, can save a lot of unnecessary crawling. Update policy: Monitor the list page to discover new pages, periodically check the page for expiration, and so on. Extract policy: How do we extract what we want from the Web page, not just the end of the target, but the next URL to crawl? Crawl frequency: We need to download a website reasonably, but not without losing efficiency.

Let me think about the topic "How to talk to a reptile," which is mainly used to cater to the crawler "crawl strategies" mentioned above.

1, through robots.txt and reptile dialogue: Search engine found a new station, in principle, the first access is robots.txt file, you can use Allow/disallow syntax to tell the search engine those files directory can be crawled and can not be crawled.

About Robots.txt Detailed introduction: About/robots.txt also need to note that: the order of Allow/disallow grammar is different

2, through the META tag and crawler dialogue: for example, sometimes we hope that the Site list page is not indexed by search engines but also want search engine crawl, then you can through Tell the crawler, Other common and NOARCHIVE,NOSNIPPET,NOODP and so on.

3, through the rel= "nofollow" and reptile Dialogue: About Rel= "nofollow" recently state Ping wrote an article "How to use good nofollow" is well worth reading, I believe you will have a great inspiration after reading.

4, through the rel= "canonical" and reptile Dialogue: About Rel= "canonical" Google Webmaster Tools help have a very detailed introduction: In-depth understanding Rel= "canonical"

5, through the site map and crawler dialogue: More common is the XML format sitemap and HTML format sitemap,xml format sitemap can be split processing or compression compression, in addition, the Sitemap address can be written to the robots.txt file.

6, through Webmaster tools and search engine dialogue: We contact the most is Google Webmaster tools, you can set the Googlebot crawl frequency, shielding do not want to be crawled links, control sitelinks and so on, in addition, Bing and Yahoo also have administrator tools, Baidu has a Baidu webmaster platform, the inside of more than a year is still in the beta, no invitation code can not register.

In addition, this also derives a concept, is that I have been more attention to the site included than, the so-called site is included in the search engine number/Site real data, the higher the site is included, indicating that the search engine on the site to crawl more smoothly.

For the time being, the goal is to try to explore how to effectively improve the site in the search engine collection.

Right as a trigger, welcome you to add!

Note:

Network crawler (web crawler) is also called network spider (Web spider) is a computer program, it from the internet according to a certain logic and algorithm to crawl and download the Web page, is an important part of the search engine.

The author of this article: Bruce, the original address.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.