General principles of search engines-search engine technology
Source: Internet
Author: User
Search engine does not really search the Internet, it is actually a predefined database of indexed Web pages.
The real search engine, usually refers to the collection of tens of millions of to billions of web pages on the Internet and every word in the Web page (that is, keywords) index, build index database Full-text search engine. When a user looks up a keyword, all pages that contain the keyword in the content of the page will be searched as search results. After the complex algorithm is sorted, these results will be ranked according to the correlation degree of the search keywords.
Now the search engine has generally used the hyper-chain analysis technology, in addition to analyzing the content of the index page itself, also analyzes the index of all links to the page URL, Anchortext, and even the surrounding text. So, sometimes, even if a page a does not have a word like "Demon Satan", but if there is another page B with the link "Demon Satan" point to the page A, then users search "Devil Satan" can also find page A. Also, if there are more pages (C, D, E, F ...). Use the link named "Demon Satan" to point to this page a, or give the source page of the link (B, C, D, E, F ...). The better, then page A will be considered more relevant when users search for "Demon Satan", and the ranking will be more forward.
The principles of search engines can be seen as three steps: Crawl Web pages from the Internet → set up index databases → search for sorting in the index database.
Crawl Web pages from the Internet
Use the spider System program, which automatically collects Web pages from the Internet, automatically access the Internet and crawl through all the URLs in any Web page, repeat the process, and collect all the pages crawled.
Setting up an index database
Analysis of the collected Web pages by analyzing the index system program to extract the relevant page information (including the URL of the page, the type of the code, the keywords contained in the page content, the location of the keyword, the time, the size, the link to other pages, etc.), and according to a certain correlation algorithm, Get the relevance (or importance) of each of the pages to the content of the page and each keyword in the hyperlink, and then use the relevant information to build the index database for the Web page.
Searching for sorting in the index database
When the user enters a keyword search, the search system program from the Web page index database to find all the relevant Web pages that match the keyword. Because all relevant web pages for the relevance of the keyword has already been good, so just according to the availability of the relevance of the ranking, the higher the correlation, the ranking the more forward.
Finally, the page generation system organizes the link address of the search result and the content summary of the page to return to the user.
Spider of search engines generally have to visit all pages regularly (the cycle of each search engine is different, may be days, weeks or months, may also be different importance of the page has different frequency update, update the index database to reflect the content of the Web page updates, add new Web information, remove dead links, and reorder them according to the content of the page and the changes in the link relationship. In this way, the contents and changes of the Web page are reflected in the results of the user's query.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.