International - English

Cart Console

Topic Center

Contact Sales

Home > Developer Tools > Technical Articles

Which pages can be saved to the search engine's server?

Last Update:2015-04-20 Source: Internet

Author: User

Keywords Search engine crawl server

Tags address content file find internet internet + it is link

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Absrtact: Search engine is the Web page content on the Internet on its own server, when users search for a word, the search engine will be on their own server to find relevant content, so that is, only saved on the search engine server on the page

Search engine is the Web page content on the Internet on its own server, when users search for a word, the search engine will be on their own server to find relevant content, so that is, only saved on the search engine server pages will be searched. Which Web pages can be saved to the search engine's servers? Only the search engine's web crawler capture the Web pages will be saved to the search engine server, the Web crawler is the search engine spider. The whole process is divided into crawling and grasping.

Spider

The search engine used to crawl and visit the Web page of the program is called a spider, also can be called a robot. Spiders visit the browser, and we usually surf a look, spiders will also apply for access to be allowed to browse, but there is a point, search engine in order to improve the quality and speed, it will put a lot of spiders to crawl and crawl.

When a spider visits any site, it first accesses the robots.txt file in the root directory of the site. If the robots.txt file prohibits search engines from crawling certain files or directories, spiders will comply with the protocol and not crawl the banned URLs.

and browsers, search engine spiders also have to indicate their identity agent name, webmaster can see the search engine in the log file to the specific agent name, so as to identify search engine spiders.

Second, tracking links

In order to crawl the Web as many pages as possible, search engine spiders will track the links on the page, from one page to the next page, like spiders crawling in the spider web.

The whole internet is made up of websites and pages that are linked to each other. Of course, because the site and page link structure is extremely complex, spiders need to take a certain crawling strategy to traverse all pages on the web.

The simplest crawling strategies are: depth first and breadth first.

1. Depth link

Depth first refers to when the spider found a link, it will follow the link pointed out the road has been crawling forward, until there is no other link before, then will return to the first page, and then continue to link to crawl forward.

2. Breadth Link

From the SEO perspective of the link breadth first means that the spider in a page to find multiple links, not follow a link has been forward, but the page all the first layer of links are crawled, and then along the second level of the link found on the page to crawl to the third layer of pages.

Theoretically, whether the depth first or breadth first, as long as the spider enough time to climb the entire Internet. In the actual work, nothing is infinite, the spider's bandwidth resources and Spider's time is also limited, it is impossible to crawl through all the pages. In fact, the biggest search engine is just crawling and collecting a small part of the Internet.

3. Attract spiders

Spider-style can not crawl all the pages, it will only crawl important pages, then which pages are considered more important?

(1) Website and page weight

(2) Page update degree

(3) Import link

(4) and the first click Distance

4. Address Library

Search engine will build an address library, this can be a good way to avoid too much crawling or repeatedly crawling phenomenon, records have been found has not crawled the page, as well as the page has been crawled.

The URLs in the address library have several sources:

(1) manual input of the seed website.

(2) Spiders crawl the page, from the HTML parse out the new link URL, and the address library of the data in contrast, if it is not in the Address library URL, save to access address library.

(3) Search engine with a form to provide webmaster, convenient webmaster Submit web site

Here, about the search engine has been almost, although for the real search engine technology is only a fur, but for SEO personnel are enough.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

how to be on the top of google search list which is the server to server protocol sql can be used to how can we convert asp pages to php pages local server must be joined to the domain the program node can be found in the following packages can java be used to make games

What is SFTP Commands Linux_the Introduction 01-20

How to Configure CentOS 7.4 SFTP Server 01-19

Build an SFTP Server Using CentOS Built-in SSH Service 01-17

Configure Linux SFTP and Configure User Access 01-16

How to Easily Configure SFTP Server Linux In 6 Steps 01-15

Automatic Upload and Download of SFTP Files_Shell Script 01-14

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

html add blank space register business logo register ssl certificate full site sign in sign up node js build cloud register register a subdomain in python network management system tutorial how to learn computer science by myself

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Which pages can be saved to the search engine's server?

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support