How does a search engine work? -Search engine technology

Source: Internet
Author: User
How does a search engine work?
I often encounter people who do not know how the search engine collects information. They know what a search engine is and understand the importance of a search engine index. Although some people know it, when you start talking about something like a crawler, they started to get stuck.

No more embarrassing scenes like shells. This article aims to help you reveal something you are not very certain about search engines. After all, if you want to get a search engine record, you 'd better understand how it works.
Merge into one
A crawling-based search engine consists of three parts: crawling programs, indexing, and software. Each part has its own functions and the three parts produce the content we see on the search engine result page (SERPs.
Hunger crawling program
Another well-known name is web crawler or robot. Search engine crawler is an automatic program that can read webpages and track any links to other webpages on the site. We often refer to it as "crawling" or "indexing ". There are three very hungry and very active crawling programs on the network. Their names are Googlebot (Google) and Slurp (Yahoo !) And MSNBot (MSN search ).
Crawlers start their network journey from a series of web pages URLs previously added to their indexes (databases. When they access these web pages, they crawl code and copy, and the new web pages (links) it finds on the network will also be added to its index. For this process, we can refer to the crawling program as being indexed and will discuss it further later.
The crawler will return to the site in its directory based on a fixed cycle to find any updates. As for the frequency of crawler return visits, this should be determined by the search engine. The website owner can pass through a file named "robot.txt" to control the frequency of crawling programs accessing the site. Search engines first view this file before crawling a website.
Growing index
Indexing is like a huge directory or detailed directory that contains a copy of every webpage or file on the website found by a crawling program. If a webpage changes, new information is used to update the directory. To enable the concept of a quantity to describe the size of these indexes, the latest number published by Google is 8 billion web pages.
Sometimes it takes some time for the crawler to find new web pages and changes and add them to its own indexes. Therefore, a webpage may have been crawled but not indexed ". After a web page is indexed, it is added to the index. At this time, the Web pages crawled by the web page cannot be searched in the search engine.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.