Technologies required by commercial search engines

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, we need to distinguish between search engines and searches. More often, we only need to search databases. A search engine is a relatively independent system that provides a relatively complete service.

As a business-level search engine, the following technologies are generally required:

1. Full-text search engine

Search keywords entered by the user based on a certain combination rate in the full text information, and provide the return index number according to the relevance, so that the pageProgramThe retrieved data is displayed by page. As a professional search engine, full-text search engines have strict requirements. First, you must return the search result within 1 second (not the page display time ). Second, make sure that the first 100 pieces of data best meet your needs. Because the search engine is a system with a high concurrency and load, the efficiency of each search task is critical. According to users' habits, few people usually flip pages to 5th pages, so the first 100 results are particularly important to the quality of search engines.

At present, Lucene is the best open-source full-text search engine.

2. Chinese Word Segmentation technology

Because Chinese does not have the inherent word splitting advantage in English, Chinese Word Segmentation is a key technology in Chinese search engines. Generally, to improve the efficiency, the full-text search engine stores Chinese information in the form of a single-word index to reduce the size of the index database and improve the index efficiency. And its built-in RetrievalAlgorithmIt is not suitable for Chinese search. We can use a keyword such as "IBM laptop hard drive" to test some search engines, and we will find that the search results without built-in Chinese word segmentation are not ideal, and even no results are returned. Once the built-in Chinese Word Segmentation is built, we can split it into "IBM laptop hard drive" so that we can obtain the closest retrieval result through the "or" relationship. An ideal Chinese word segmentation technique is not completely based on Word Segmentation accuracy. Because the word segmentation required by a search engine is different from that used for machine translation or artificial intelligence at the Professional level, it faces relatively small word segmentation statements, therefore, we need to strike a balance between accuracy and performance.

3. Information Collection Crawlers

As a means of information collection, crawler efficiency is particularly important. Under normal circumstances, multiple crawlers collaborate in the form of a consortium, which goes down a path, sorts and parses the obtained valid data, and finally submits it to the system for index storage. Because the implementation technology is different, I will not elaborate on it again. Here are some of the key points:

Crawlers should have their own unique and fixed names to facilitate traffic statistics and analysis on target websites.
Crawlers must follow certain rules. The international practice is to place the content in the root directory of the website to the robots.txt file to indicate crawler actions.
Strictly check the circular path and crawler traps to avoid paralyzing the target website.
. Restrict the crawler collection time to avoid the operation peak time of the target website.

This article from http://www.rainsts.net

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Technologies required by commercial search engines

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Technologies required by commercial search engines

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support