Absrtact: Search engine is the Web page content on the Internet on its own server, when users search for a word, the search engine will be on their own server to find relevant content, so that is, only saved on the search engine server on the page
Search engine is the Web page content on the Internet on its own server, when users search for a word, the search engine will be on their own server to find relevant content, so that is, only saved on the search engine server pages will be searched. Which Web pages can be saved to the search engine's servers? Only the search engine's web crawler capture the Web pages will be saved to the search engine server, the Web crawler is the search engine spider. The whole process is divided into crawling and grasping.
Spider
The search engine used to crawl and visit the Web page of the program is called a spider, also can be called a robot. Spiders visit the browser, and we usually surf a look, spiders will also apply for access to be allowed to browse, but there is a point, search engine in order to improve the quality and speed, it will put a lot of spiders to crawl and crawl.
When a spider visits any site, it first accesses the robots.txt file in the root directory of the site. If the robots.txt file prohibits search engines from crawling certain files or directories, spiders will comply with the protocol and not crawl the banned URLs.
and browsers, search engine spiders also have to indicate their identity agent name, webmaster can see the search engine in the log file to the specific agent name, so as to identify search engine spiders.
Second, tracking links
In order to crawl the Web as many pages as possible, search engine spiders will track the links on the page, from one page to the next page, like spiders crawling in the spider web.
The whole internet is made up of websites and pages that are linked to each other. Of course, because the site and page link structure is extremely complex, spiders need to take a certain crawling strategy to traverse all pages on the web.
The simplest crawling strategies are: depth first and breadth first.
1. Depth link
Depth first refers to when the spider found a link, it will follow the link pointed out the road has been crawling forward, until there is no other link before, then will return to the first page, and then continue to link to crawl forward.
2. Breadth Link
From the SEO perspective of the link breadth first means that the spider in a page to find multiple links, not follow a link has been forward, but the page all the first layer of links are crawled, and then along the second level of the link found on the page to crawl to the third layer of pages.
Theoretically, whether the depth first or breadth first, as long as the spider enough time to climb the entire Internet. In the actual work, nothing is infinite, the spider's bandwidth resources and Spider's time is also limited, it is impossible to crawl through all the pages. In fact, the biggest search engine is just crawling and collecting a small part of the Internet.
3. Attract spiders
Spider-style can not crawl all the pages, it will only crawl important pages, then which pages are considered more important?
(1) Website and page weight
(2) Page update degree
(3) Import link
(4) and the first click Distance
4. Address Library
Search engine will build an address library, this can be a good way to avoid too much crawling or repeatedly crawling phenomenon, records have been found has not crawled the page, as well as the page has been crawled.
The URLs in the address library have several sources:
(1) manual input of the seed website.
(2) Spiders crawl the page, from the HTML parse out the new link URL, and the address library of the data in contrast, if it is not in the Address library URL, save to access address library.
(3) Search engine with a form to provide webmaster, convenient webmaster Submit web site
Here, about the search engine has been almost, although for the real search engine technology is only a fur, but for SEO personnel are enough.