Getting Started with app back-end search

Source: Internet
Author: User
Tags solr

Now people's network life is inseparable from the search, encountered the problem, want to know things, search, you know the answer.

In the app, the most common search scenario is searching for users. Only hundreds of, thousands of of the user, you can directly use like such a fuzzy query, but, if the data has millions of, even tens of thousands of times, a like query database is blocked. To a certain level of time, you have to consider the use of specialized search technology.

1. A simple search example

There are three rows of data:

(1) Nearly 2 weeks 80% shareholders loss of over 10%.

(2) Full-position Chinese dream.

(3) Shareholders lose a set of three in two days.

For example, there is a need, from the above 3 rows of data, the inclusion of "shareholder" the key word of the data to find out.

As a general rule, find each row of data separately:

The first row of data from start to end to find a "shareholder" this keyword.

The second row of data from start to finish, there is no "shareholders" this keyword.

The third row of data from start to end to find a "shareholder" this keyword.

Based on the results of the search, the first and third row of data contains the keyword "shareholder".

2. Fundamentals of Search Technology

According to the above procedure, each search will need to check each line of data from beginning to end.

If you need to find a keyword from millions of thousands of data, you can imagine how inefficient the reader is.

Let's take a look at the search engine example, the search engine searches the "shareholder" the result of this keyword:

Figure 1

In the search results of search engine, it is the direct display of all the data containing the keyword "shareholder".

How does it do that in a huge amount of information, quick Search contains the information of the keyword?

The key to implementing the search is the word segmentation and reverse index.

If we know how many keywords are in each row of data, and then create a mapping table that records the rows of data that each keyword appears in, the search becomes easy. When you know a keyword, you just need to find this mapping table, find this keyword, according to the keyword established by the mapping relationship can be found to contain this keyword data.

The process of knowing how many keywords are contained in each row of data is a word breaker. Here's a question, what is a keyword?

Keywords, in fact, is a word or sentence, for example, when I have the need, "shareholders" can be a search keyword, but, "shares" can also be the keyword of the search, "people" can also be a search keyword. What is a keyword depends on the user's needs. Therefore, in order to accurately analyze how many keywords a row of data contains, you need a dictionary that contains all the words or sentences to analyze what keywords are in the data.

Create a mapping table to record each keyword in which row of data is recorded, the process is to build a reverse search.

Let's take a practical example to see how the participle and the reverse index are established.

Or back to the above example of three rows of data, the left is the number of data, the right is the content of the data.

(1) Nearly 2 weeks 80% shareholders loss of over 10%.

(2) Full-position Chinese dream.

(3) Shareholders lose a set of three in two days.

First, the analysis of the above each row of data contains the number of keywords (here in order to simplify the word segmentation process, not each character or number as a keyword, for example, "People" should be a keyword, but in order to simplify the word segmentation, not as a keyword), the results are shown in table 1.

Table 1

The following table 1 is based on the results of a mapping table 2, each keyword appears in which row of data recorded

Table 2

With the above table 2, it is easy to know that the "shareholder" of the keyword in the data 1,3 appeared. If you need to know where the keyword "China" appears, it is also easy to see that it appears in data 2 by looking up table 2.

In such a few lines of data, it is not possible to experience the efficiency of a reverse index. If the amount of data to million, tens of millions, or even billions of dollars, the efficiency of the reverse index is very obvious. In the final analysis, this data structure is built to achieve fast search.

Further, on the right side of table 2, in addition to the record keywords appear in which row of data, but also can record in a row of data appear in the frequency, the location and other information, if you are interested in continuing in-depth understanding of the search engine technology, can read "This is the search engine: Core technology detailed" (Zhang Junlin), This article is simply a brief introduction to the fundamentals of search engines.

3. Introduction to common open source search software

Search technology is not simple, if we want to start from scratch, do not know which year to use the app to use the search function. Fortunately, Daniel has open up a large number of search software for us, as long as we will use these search software to provide the API, we can integrate search technology in the background of the app. The following is a brief introduction to common search software.

(1) Lucene

Lucene is a subproject of the Apache Software Foundation 4 Jakarta Project group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages).

Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis. Lucene is a set of open source libraries for full-text search and search, supported and provided by the Apache Software Foundation. Lucene provides a simple yet powerful application interface that enables full-text indexing and searching. In the Java development environment Lucene is a mature free open source tool. For its part, Lucene is the most popular free Java Information Retrieval library in the current and recent years.

(2) SOLR

SOLR is a high performance, JAVA5 developed, Lucene-based full-text Search server. At the same time, it is extended to provide a richer query language than Lucene, while it can be configured, extensible and optimized for query performance, and provides a perfect function management interface, is a very good full-text search engine. It provides an API interface similar to Web-service. The user can submit an XML file of a certain format to the search engine server via an HTTP request, generate an index, or make a lookup request through an HTTP GET operation and get the returned result in XML format.

(3) Elasticsearch

Elasticsearch is a Lucene-based search server. It provides a distributed multi-user-capable full-text search engine, based on a restful web interface. Elasticsearch was developed in Java and published as an open source under the Apache license terms, and is the second most popular enterprise search engine.

(4) Sphinx

Sphinx is an SQL-based full-text search engine that can be combined with mysql,postgresql for full-text searching, which provides a more specialized search function than the database itself, making it easier for applications to implement specialized full-text searches. Sphinx specifically designed search API interfaces for some scripting languages, such as Php,python,perl,ruby, and also designed a storage engine plugin for MySQL.

(5) Coreseek

Coreseek is a Chinese full-text search/search software, GPLV2 license Agreement open source release, based on Sphinx Research and Development and independent publishing, specializing in Chinese search and information processing field, for industry/vertical Search, forum/Site search, database search, Document/literature search, information retrieval, Data mining and other application scenarios, users can download and use for free.

Coreseek once in my architecture over two app background depth used, simple configuration, high performance, integrated Sphinx and Chinese word segmentation, quickly completed the development of the search module. But the biggest drawback is that the stable version does not support real-time indexing, the beta version is supported, but not used in the production environment.

The principle of Coreseek is as follows: 3

Figure 2

Coreseek has two core modules indexer and search.

Indexer: Responsible for pulling data source from MySQL, data source segmentation, indexing

Search: Searching for modules

The process for the entire project is as follows:

1. Indexer module pulls data from MySQL

2. Indexer module to set the data through Chinese word segmentation, indexing

3. The client initiates a search request to the search module

4. Seach module finds data in an index

5. The seach module gets data such as the ID of the data that meets the requirements in the index

6. Return data to the client

In addition, there is a small experience to share, when searching, some users directly by typing pinyin to replace Chinese characters, such as 2:

Figure 3

This situation, is to record the key word, but also to record the key word pinyin, pinyin is also indexed, you can achieve the use of pinyin search.

Resources:

1. http://baike.baidu.com/link?url= Rnbw3tzh-ojyebopsuvwzpgz-stike5zfqsjatv234hffpjkyeyr3djjjrbzkrscbg2ngzv-la7dfqhf5xbeoq

2. http://baike.baidu.com/link?url=C92bKEtkJtap8FfRjpSX4m5-yGE1Dn6O-00FRV5RwLe-EOkJ6FIvfl7amUuYceb-5jOD3Zn0Oy1_ 1vh7lg0rxk

3. http://baike.baidu.com/link?url= Xh1aiphlriiq3jdugb8j8at7qpyxs1rvduvuqe76z0wldzvupfui8y7pbthyyiuzyyab5wuxfzjqs5oanrh5phpo7xyvdfsvuv5jlnvud33

4. http://www.coreseek.cn/

Getting Started with app back-end search

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.