[Disclaimer: All Rights Reserved. You are welcome to reprint it. Do not use it for commercial purposes. Contact Email: feixiaoxing @ 163.com]
When talking about search engines, you must be familiar with them. For me at least, Baidu does not count dozens of times a day. In terms of information query and search, Baidu has helped me a lot. Of course, there are also a lot of search results I am not very satisfied. So I don't know if you are interested in knowing how the search engine is made? In fact, it is simple and complicated.
As you know, the Web search results are searched by web pages. Generally, if a webpage contains this keyword, this webpage is what we need. However, in fact, this is not the case. As you know, there are a lot of web pages on the Internet. It is basically impossible to find the desired result in less than one second. Not to mention, it is not easy to download billions of web pages from the Internet, let alone search and traverse them later. Therefore, all search results are pre-processed by computers. With this basic concept, it is easier to explain them later.
Creating a simple search engine is not complicated. The key steps can be divided into three parts: Page traversal and download, Chinese Word Segmentation and resolution, query results and sorting.
(1) webpage download and traversal
You must have your own search engine and data, which is the number of webpages. Speaking of this, we have to say that a major feature of a Web page is hyperlink. With hyperlinks, we can continue crawling to other webpages. Speaking of this, there are several questions that need to be considered. How should we crawl this webpage? Should we first traverse it by width or by breadth, how can we determine which web pages have already been traversed and which ones have not yet been traversed? What if we use multithreading to traverse the web pages?
(2) Chinese Word Segmentation and inverted file placement
With your own webpage, this is only a basic condition. Next, we need to split the data on the web page and separate each sentence on the web page. This is the so-called Chinese word segmentation. Of course, there are many ways to use Chinese word segmentation. Basically, you can use the dictionary to disassemble the word, either from left to right or from right to left, or you can separate them according to the minimum number of phrases. Of course, the number of web pages is constantly increasing, but in general, the number of phrases in a language is certain, and in general, the number of Chinese phrases is about 100,000. Other languages, such as English, because English is composed of one word. According to the Oxford Dictionary standard, the maximum number of words is about 100,000 words. Therefore, we have designed the format of Inverted Files. In the simplest words, we no longer focus on the words contained in a Web page, but more on the pages where a word appears. This is the basic idea of Inverted Files.
(3) index search and sorting
With inverted files, we can save these index results in the database. Therefore, we note that any search results are merged from this database. Of course, after finding these results, how can we determine which sorting results should be ranked first and which results should be ranked later. Speaking of this, PageRank is an algorithm that you have to mention. The basic idea of this algorithm is that all people want the result as you want. A webpage is important because it is often referenced by others. For example, the quality of a paper that has been cited for many times is definitely not good. Of course, there are other factors that affect sorting, such as Word Frequency, webpage date, website weight, keywords purchased by customers, cheating webpages, and title text.
Choosing this topic to write my blog has nothing to do with my work, mainly because of my personal interests. Of course, the style of the entire article is similar to that of my previous blog. It is basically written by a title and a title. The C language code is interspersed in the middle to describe how this function is implemented. It would be my greatest pleasure if I could help you.