Some common concepts about search engines

Source: Internet
Author: User

A search engine usually refers to a full-text search engine that collects tens of millions to billions of web pages on the Internet and indexes each word (that is, a keyword) on the web page to create an index database. When a user searches for a keyword, all webpages containing the keyword in the page content will be searched as search results. After complex algorithms are sorted, these results are sorted in sequence based on the relevance of the search keyword. You can obtain the desired ranking based on your optimization degree.
In the search engine background, there are some programs used to collect web page information. The collected information is generally a keyword or phrase that indicates the website content (including the webpage itself, the URL address of the webpage, the code that constitutes the webpage, and the connection to and from the webpage. Then, store the index of the information in the database.

The system architecture and running mode of the search engine absorb a lot of valuable experience in the design of the information retrieval system, and made many modifications to the characteristics of the World Wide Web data and users, the search engine system architecture is shown in the right figure. The core document processing and query processing processes are similar to those of traditional information retrieval systems, however, the complex feature of the Data Objects processed by the engine determines that the search engine system must adjust the system structure to meet the needs of processing data and user queries.

The working principle of the search engine is actually very simple. If you intend to become a qualified Seo engineer, understanding the principle is the foundation.

Search engines are roughly divided into four parts: The first part is the spider, the second part is the data analysis system, the third part is the index system, and the fourth is the query system, of course, this is only the basic four parts.
There are many technical points related to each part. They will be analyzed one by one as needed in the future.

Baidu encyclopedia mentioned that the core data structure of search engines is Inverted Files (also called inverted indexes) (to be analyzed). Inverted indexes refer to non-primary attribute values (also called secondary keys) of records) to find records, and the Organization's files are called Inverted Files, that is, secondary indexes. The inverted file contains all secondary key values and lists the primary key values of all related records. This file is mainly used for complex queries. Unlike traditional SQL queries, in the pre-processing stage of data collected by search engines, search engines often require an efficient data structure to provide external retrieval services. The most effective data structure is "Inverted File ". An inverted file can be defined as a structure that uses the keywords of documents as indexes and documents as index targets (similar to normal books, indexes are keywords, the page of the book is the index target ).

The full-text search engine's "Web robot" or "web spider" is a kind of software on the network. it traverses the web space and can scan websites within a certain IP address range, and collect webpage information from one webpage to another along the links on the network. To ensure that the collected data is up-to-date, the system will return to webpages that have been crawled. Web pages collected by Web Robots or web spiders must be analyzed by other programs. A large number of computation is performed based on a certain relevance algorithm to create web page indexes before they can be added to the index database. The full-text search engine we usually see is actually a Retrieval Interface of the search engine system. When you enter keywords for query, the search engine will find indexes of all related webpages that match the keyword from a large database and present them to us according to certain ranking rules. Different search engines have different Web index databases and different ranking rules. Therefore, when we use different search engines to query with the same keyword, the search results will be different.

Like full-text search engines, the entire working process of classification directories is also divided into three parts: collection information, analysis information, and query information, however, the collection and analysis of classified directories are mainly done manually. CATEGORY directories generally have dedicated editors responsible for collecting website information. With the increasing number of sites included, website administrators usually submit their website information to the classified directory, and then the editors of the classified directory review the submitted website to determine whether to include the site. If the site is approved, the editors of the classification directory also need to analyze the content of the site and put the site in the corresponding category and directory. All these sites are also stored in an "index database. When querying information, you can search by keyword or by category directory. For example, if you search by keyword, the returned results are the same as those of the full-text search engine. websites are also sorted based on the degree of Information Association. You must note that the keyword query of a category directory can only be performed in the website name, URL, description, and other content. The query result is only the URL address of the website homepage, instead of the specific page. The classification directory is like a thin phone number. According to the nature of each website, the website is arranged in different categories, with a small category under the category until the detailed address of each website, generally, you can also query the content of each website without using keywords. You only need to find the relevant directories to find the relevant websites (Note: it is a related website, instead of the content of a webpage on the website, the ranking of the website in a directory is generally determined by the order of the title letters or the recorded time order ).
A search engine does not really search for the Internet. It actually searches for a pre-organized Web index database.

In the true sense, a search engine usually refers to collecting tens of millions to billions of webpages on the Internet and indexing each word (that is, a keyword) in the webpage, create a full-text search engine for the index database. When a user searches for a keyword, all webpages containing the keyword in the page content will be searched as search results. After complex algorithms are sorted, these results are sorted in sequence based on the relevance of the search keyword.

In addition to indexing the content of a web page, the search engine currently uses the hyperlink analysis technology. It also analyzes the indexing of all the URLs, anchortext, and even the text around the links that point to the web page. Therefore, sometimes, even if a webpage A does not contain a word such as "Devil Satan", if another webpage B directs "Devil Satan" to this webpage, when searching for "Devil Satan", users can also find webpage. Moreover, if there are more web pages (c, d, e, f ......) Use a link named "Devil Satan" to point to webpage A, or provide the source webpage of this link (B, c, d, e, f ......) The better, web page A will be considered more relevant when users search for "Devil Satan", and the higher the ranking.

The principle of the search engine can be viewed as three steps: crawling webpages from the Internet → creating an index database → searching and sorting in an index database.

Capture webpages from the Internet
The spider system program that can automatically collect web pages from the Internet automatically accesses the Internet and crawls all URLs on any web page to other web pages. repeat this process, and collect all webpages crawled.

Create an index database
The analysis index system program analyzes the collected web pages, extract the relevant webpage information (including the URL, encoding type, keywords contained in the page content, keyword location, generation time, size, and link to other webpages ), perform a large number of complex calculations based on a certain relevance algorithm to obtain the relevance (or importance) of each webpage for each keyword in the page content and hyperchain, and then use the relevant information to create a web index database.

Search and sort in the index database
After you enter a keyword for search, the search system program will find all related webpages that match the keyword from the Web index database. Because the relevance of all related webpages for this keyword has long been well calculated, you only need to sort the keywords by the ready-made relevance values. The higher the relevance, the higher the ranking.
Finally, the page generation system organizes the URL and Content summary of the search results and returns them to the user.
Search engine spider generally needs to regularly access all web pages (different search engines have different cycles, which may be days, weeks, or months, it may also have different Update Frequencies for webpages of different importance). The index database of webpages is updated to reflect the updates of webpage content, add new webpage information, and remove dead links, and re-ordered based on the changes in the webpage content and links. In this way, the specific content and changes of the webpage will be reflected in the user's query results.

Although there is only one Internet, different search engines have different capabilities and preferences. Therefore, different web pages are crawled and different sorting algorithms are applied. The databases of large search engines store hundreds of millions to billions of Web indexes on the Internet, and the data volume reaches thousands or even tens of thousands of GB. However, even if the largest search engine creates an index database with more than 2 billion webpages, it can only account for less than 30% of Normal web pages on the Internet. The data overlap rate between different search engines is generally less than 70%. An important reason for using different search engines is that they can search for different content separately. The Internet has a large amount of content, which cannot be crawled by the search engine, and cannot be searched by the search engine.

You should have this concept in mind: a search engine can only search for the content stored in its webpage index database. You should also have this concept: if the search engine's Web index database should be available but you haven't found it, it is your ability problem. Learning search skills can greatly improve your search capabilities.

The above content is a bit messy and will be sorted and optimized later.

Dml@2013.2.17

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.