The working principle of the search engine

Source: Internet
Author: User
Tags web database

On the Internet, especially on the Web (World Wide Web), you surf the internet every day, but do you know about search engines? How do they work? Which search engines do you use? Here's how to search engine work.

I. Classification of search engines

We can call it a search engine by getting the Web page data, the ability to build a database and provide a query system. Depending on how they work, they can be grouped into two basic categories: Full-text Search engines (fulltext search Engine) and catalog directory.
At present, search engine and catalog index have the tendency of merging and infiltration. Originally some pure full-text search engines now also provide directory search, such as Google to borrow the Open Directory directory to provide classification queries. And the old directory index like Yahoo! The search scope by partnering with Google and other search engines (note). In the default search mode, some directory-class search engines first return to their own directory matching sites, such as the domestic Sohu, Sina, NetEase, etc., while others are the default Web search, such as Yahoo.

To carry out search engine optimization, first of all to know how the search engine is how to work, only know how the search engine is working, in order to better optimize the work, make the search engine more friendly site, so that will have a good ranking.
The main part of the search engine consists of three parts: Spider program, index and software. Here we take Google for example.
Spider Program
Google's spider program is divided into the main spider and from the spider, when Google fully updates the database or include a new site, send the main spider, the site for a comprehensive index (such as the inclusion of new pages, redefine the level of the page, etc.), when Google's daily updates to the site, sent from the spider, the site content maintenance. When the page changes, it updates the page from the spider and crawls the content again. The spider program will revisit the site in its directory based on a fixed cycle, looking for updates. As to the frequency of the crawl program return, this is determined by the search engine. The site owner can actually control the crawl program to access the site's pages by using a file called Robot.txt. The search engine first looks at the file before it crawls a website further.
Directory index
The directory index is like a huge directory of Web sites, all of which are the list of sites crawled by its spider program. According to Google's data, currently, Google has included 8 billion sites, and the update of these indexes is quite time-consuming, the general update cycle of about one months, so, for a new site, the spider program may have crawled your site, but no columns such as the index, And the first to be included is the basic index, has not been included in its main index, only when the next time Google updates the index will be included in the main index, during which Google will have a corresponding assessment of the site, will temporarily appear a better ranking, but at this time the ranking is not a real ranking, Only when Google is next updated will it be translated into a true ranking. That is to say, why a new site is indexed and cannot find a ranking, or a new site has just started to rank very well, and over time ranked down or not found the reason.
As for whether spiders crawl over your page, what time to crawl your site, etc., see related articles: View the server log.
Program
Google's index of the site in accordance with its own unique procedures to judge, for each site classification, scoring and the content of the Web page analysis, find keywords, when the user input a keyword search, will be ranked according to the analysis of good indexes and display.
The site to determine keywords, classification, and rankings are all automatically completed by the program, without any manual intervention, which is to reflect the fairness of Google, fair, show to the user is the most authentic, best content.

Full-Text search engine database is to rely on a call "network Robot (Spider)" or "network spider (crawlers)" software, through a variety of links on the network automatically obtain a large number of Web page information content, and according to the rules of the analysis of the formation. Google, Baidu are more typical of the Full-text search engine system.

Classification directory is through the manual collection of Web site data to form a database, such as Yahoo China and the domestic Sohu, Sina, NetEase classification directory. In addition, some navigation sites on the Internet can also be attributed to the original categories, such as "home of the Web site."

Full-text search engines and categories are used in each of the length. Full-text search engine because of the software, so the capacity of the database is very large, but its query results are often not accurate; The catalog relies on manual collection and collation of the site, can provide more accurate results of the query, but the collection of content is very limited. In order to complement each other, now many search engines, both provide both types of inquiries, the general search engine query called "All Sites" or "all sites", such as Google's full-text search (www.baidu.com); the query to the category directory is called search "category directory" or search " Classified sites, such as Sina Search and Yahoo China Search (http://cn.search.yahoo.com/dirsrch/).

On the Internet, the two types of search engine integration, but also generate other search services, here, we also call them search engines, there are two main categories:

⒈ meta search (meta search Engine). Such search engines generally do not have their own network robots and databases, their search results are by invoking, controlling and optimizing the search results of many other independent search engines and in a unified format in the same interface display. Although the meta search engine does not have "network robot" or "network Spider", and has no independent index database, it has its own research and development feature meta search technology in the aspects of Retrieval request submission, retrieval interface proxy and search result display. such as "Metafisher Meta search Engine"
(http://www.hsfz.net/fish/), it calls and integrates Google, Yahoo, AllTheWeb, Baidu and Openfind, and many other search engine data.

⒉ Integrated search engine (All-in-one search Page). Integrated search engine is through the network technology, in a Web page linked to a number of independent search engines, query, select or specify a search engine, one input, multiple search engines at the same time query, search results by each search engine to display different pages, such as "Internet Swiss Army Knife" (http:// free.okey.net/%7efree/search1.htm).

Full-Text search engine
In the Search engine classification section We mentioned the full text search engine from the site to extract information to build a Web database concept. The search engine's automatic information collection function is divided into two kinds. One is a regular search, that is, every once in a while (such as Google is usually 28 days), search engine actively sent "spider" Program, a certain IP address range of Internet stations to retrieve, once found a new site, it will automatically extract the site's information and Web site to join their own database.

The other is to submit web search, that is, the site owner actively to the search engine to submit the Web site, it in a certain period of time (2 days to several months) directed to your site sent "spider" Program, scan your site and the relevant information into the database, in case the user inquiries. As a result of the search engine indexing rules have changed a lot in recent years, unsolicited Web site does not guarantee your website access to the search engine database, so the best way is to get some external links, so that search engines have more opportunities to find you and automatically your site included.

When users search for information by keyword, search engine will be in the database search, if found with the user requirements of the content of the site, then use a special algorithm-usually according to the page of the keyword matching degree, the position/frequency, link quality, etc.-to calculate the relevance of each page and ranking level, The links are then returned to the user in order, depending on the degree of relevance.

........................................................................................

3 Directory Index
There are many differences in catalog indexing compared with Full-text search engines.

First of all, search engines belong to automatic Web site retrieval, while directory indexing is entirely dependent on manual operation. After the user submits the site, the directory editor will personally visit your site and then decide whether to accept your site based on a set of criteria and even the editors ' subjective impressions.

Second, the search engine included the site, as long as the site itself does not violate the relevant rules, the general can log in successfully. Directory indexing is a much higher requirement for web sites, and sometimes it may not be successful even if you log on multiple times. Especially Yahoo! such as the Super Index, login is more difficult. (Due to login Yahoo! is the most difficult, and it is business Network marketing battleground, so we will use a dedicated space in the back of the Yahoo Yahoo skills to introduce)

In addition, when you log on to a search engine, we generally do not have to consider the classification of the site, and the login directory index must be placed in the most appropriate directory (directory).

Finally, the search engine information about each site is automatically extracted from the user's web page, so the user's point of view, we have more autonomy, while the directory index requires the need to manually fill out the site information, but also a variety of restrictions. What's more, if the staff think you submitted the site's directory, website information is not appropriate, he can adjust it at any time, of course, will not discuss with you beforehand.

Directory index, as the name implies is to store the site in the corresponding directory, so the user in the query information, you can choose keyword Search, can also be categorized by the level of the directory to find. such as keyword search, return the results of the same as the search engine, but also according to the degree of information related to arrange the site, but there are more human factors. If you are looking at a hierarchical directory, the ranking of the sites in a directory is determined by the order of the titles (with exceptions).

  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.