First, we need to distinguish between search engines and searches. More often, we only need to search databases. A search engine is a relatively independent system that provides a relatively complete service.
As a business-level search engine, the following technologies are generally required:
1. Full-text search engine
Search keywords entered by the user based on a certain combination rate in the full text information, and provide the return index number according to the relevance, so that the pageProgramThe retrieved data is displayed by page. As a professional search engine, full-text search engines have strict requirements. First, you must return the search result within 1 second (not the page display time ). Second, make sure that the first 100 pieces of data best meet your needs. Because the search engine is a system with a high concurrency and load, the efficiency of each search task is critical. According to users' habits, few people usually flip pages to 5th pages, so the first 100 results are particularly important to the quality of search engines.
At present, Lucene is the best open-source full-text search engine.
2. Chinese Word Segmentation technology
Because Chinese does not have the inherent word splitting advantage in English, Chinese Word Segmentation is a key technology in Chinese search engines. Generally, to improve the efficiency, the full-text search engine stores Chinese information in the form of a single-word index to reduce the size of the index database and improve the index efficiency. And its built-in RetrievalAlgorithmIt is not suitable for Chinese search. We can use a keyword such as "IBM laptop hard drive" to test some search engines, and we will find that the search results without built-in Chinese word segmentation are not ideal, and even no results are returned. Once the built-in Chinese Word Segmentation is built, we can split it into "IBM laptop hard drive" so that we can obtain the closest retrieval result through the "or" relationship. An ideal Chinese word segmentation technique is not completely based on Word Segmentation accuracy. Because the word segmentation required by a search engine is different from that used for machine translation or artificial intelligence at the Professional level, it faces relatively small word segmentation statements, therefore, we need to strike a balance between accuracy and performance.
3. Information Collection Crawlers
As a means of information collection, crawler efficiency is particularly important. Under normal circumstances, multiple crawlers collaborate in the form of a consortium, which goes down a path, sorts and parses the obtained valid data, and finally submits it to the system for index storage. Because the implementation technology is different, I will not elaborate on it again. Here are some of the key points:
Crawlers should have their own unique and fixed names to facilitate traffic statistics and analysis on target websites.
Crawlers must follow certain rules. The international practice is to place the content in the root directory of the website to the robots.txt file to indicate crawler actions.
Strictly check the circular path and crawler traps to avoid paralyzing the target website.
. Restrict the crawler collection time to avoid the operation peak time of the target website.
This article from http://www.rainsts.net