Wuhan SEO: Analysis of the search engine spider's working way

Source: Internet
Author: User
Tags root directory

Wuhan SEO today wants to talk about search engine spider's working way. Let's talk about the principles of search engines. Search engine is the Web page content on the Internet on its own server, when the user search for a word, the search engine will be on their own server to find relevant content, so that is, only saved on the search engine server pages will be searched. Which Web pages can be saved to the search engine's servers? Only the search engine's web crawler capture the Web pages will be saved to the search engine server, the Web crawler is the search engine spider. The whole process is divided into crawling and grasping.

 First, the spider

The search engine used to crawl and visit the Web page of the program is called a spider, also can be called a robot. Spiders visit the browser, and we usually surf a look, spiders will also apply for access to be allowed to browse, but there is a point, search engine in order to improve the quality and speed, it will put a lot of spiders to crawl and crawl.

When a spider visits any site, it first accesses the robots.txt file in the root directory of the Web site. If the robots.txt file prohibits search engines from crawling certain files or directories, spiders will comply with the protocol and not crawl the banned URLs.

and browsers, search engine spiders also have to indicate their identity agent name, webmaster can see the search engine in the log file to the specific agent name, so as to identify search engine spiders.

  Second, tracking links

In order to crawl the Web as many pages as possible, search engine spiders will track the links on the page, from one page to the next page, like spiders crawling in the spider web.

The whole internet is made up of websites and pages that are linked to each other. Of course, because the site and page link structure is extremely complex, spiders need to take a certain crawling strategy to traverse all pages on the web.

The simplest crawling strategies are: depth first and breadth first.

1. Depth link

Depth first refers to when the spider found a link, it will follow the link pointed out the road has been crawling forward, until there is no other link before, then will return to the first page, and then continue to link to crawl forward.

2. Breadth Link

From the SEO perspective of the link breadth first means that the spider in a page to find multiple links, not follow a link has been forward, but the page all the first layer of links are crawled, and then along the second level of the link found on the page to crawl to the third layer of pages.

Theoretically, whether the depth first or breadth first, as long as the spider enough time to climb the entire Internet. In the actual work, nothing is infinite, the spider's bandwidth resources and Spider's time is also limited, it is impossible to crawl through all the pages. In fact, the largest search engine is just crawling and collecting a small part of the Internet.

3. Attract spiders

Spider-style can not crawl all the pages, it will only crawl important pages, then which pages are considered more important? There are several points:

(1) Website and page weight

(2) Page update degree

(3) Import link

(4) and the first click Distance

4. Address Library

Search engine will build an address library, this can be a good way to avoid too much crawling or repeatedly crawling phenomenon, records have been found has not crawled the page, as well as the page has been crawled.

The URLs in the address library have several sources:

(1) manual input of the seed website.

(2) Spiders crawl the page, from the HTML parse out the new link URL, and the address library of the data in contrast, if it is not in the Address library URL, save to access address library.

(3) Search engine with a form to provide webmaster, convenient webmaster Submit web site

Here, about the search engine has been almost, although for the real search engine technology is only a fur, but for SEO personnel are enough. Original address: Http://www.yidunseo.com/blog/gzfs.html This is the million Shield Wuhan SEO Training Trainee Blog's second article, understand so much later is not more conducive to our own site optimization pinch!



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.