Technologies required by commercial search engines

Source: Internet
Author: User

First, we need to distinguish between search engines and searches. More often, we only need to search databases. A search engine is a relatively independent system that provides a relatively complete service.

As a business-level search engine, the following technologies are generally required:

1. Full-text search engine

Search keywords entered by the user based on a certain combination rate in the full text information, and provide the return index number according to the relevance, so that the pageProgramThe retrieved data is displayed by page. As a professional search engine, full-text search engines have strict requirements. First, you must return the search result within 1 second (not the page display time ). Second, make sure that the first 100 pieces of data best meet your needs. Because the search engine is a system with a high concurrency and load, the efficiency of each search task is critical. According to users' habits, few people usually flip pages to 5th pages, so the first 100 results are particularly important to the quality of search engines.

At present, Lucene is the best open-source full-text search engine.

2. Chinese Word Segmentation technology

Because Chinese does not have the inherent word splitting advantage in English, Chinese Word Segmentation is a key technology in Chinese search engines. Generally, to improve the efficiency, the full-text search engine stores Chinese information in the form of a single-word index to reduce the size of the index database and improve the index efficiency. And its built-in RetrievalAlgorithmIt is not suitable for Chinese search. We can use a keyword such as "IBM laptop hard drive" to test some search engines, and we will find that the search results without built-in Chinese word segmentation are not ideal, and even no results are returned. Once the built-in Chinese Word Segmentation is built, we can split it into "IBM laptop hard drive" so that we can obtain the closest retrieval result through the "or" relationship. An ideal Chinese word segmentation technique is not completely based on Word Segmentation accuracy. Because the word segmentation required by a search engine is different from that used for machine translation or artificial intelligence at the Professional level, it faces relatively small word segmentation statements, therefore, we need to strike a balance between accuracy and performance.

3. Information Collection Crawlers

As a means of information collection, crawler efficiency is particularly important. Under normal circumstances, multiple crawlers collaborate in the form of a consortium, which goes down a path, sorts and parses the obtained valid data, and finally submits it to the system for index storage. Because the implementation technology is different, I will not elaborate on it again. Here are some of the key points:

Crawlers should have their own unique and fixed names to facilitate traffic statistics and analysis on target websites.
Crawlers must follow certain rules. The international practice is to place the content in the root directory of the website to the robots.txt file to indicate crawler actions.
Strictly check the circular path and crawler traps to avoid paralyzing the target website.
. Restrict the crawler collection time to avoid the operation peak time of the target website.

 

This article from http://www.rainsts.net

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.