ES search engine

Source: Internet
Author: User

Basic concepts:

Indexed index

ES bar data is placed in one or more indexes, and if compared with a relational database model, the index is equivalent to the database instance (DB). The base unit for index storage and reading is the document. ES is used internally to read and write data in the index that Apache Lucene implements. (es is considered a separate index, more than one in Lucene, because in distributed, ES uses a partition shards and a backup replicas mechanism to store multiple copies of an index).

Document documents

In Es, the document is primarily a storage entity. All ES applications need to be unified to build a search model: Retrieve related documents.

A document consists of one or more domains, each of which consists of a domain name or multiple values (called multi-domain with multiple values).

In ES, each document may have a different domain field collection, which means that the document has no fixed schema or consent structure. The similarity between documents is maintained.

From a client perspective, a document is a JSON object.

Parameter mapping All documents must be analyzed (analyze) before they are stored, and the user can configure how input text is decomposed into tokens: which tokens are filtered out, or other processes, such as removing HTML tags.

Document types (type)

Each document must have its type set in ES. The document type makes it possible to find the corresponding parameter mapping information according to the document type in the same index when the storage structure document is different, which facilitates the document access.

Nodes node

A separate ES server instance becomes a node.

Cluster cluster

A cluster can store information that exceeds the capacity of a single machine. As the current single point can meet our needs, it is not described in detail.

Index replica Replica

The index shard mechanism can be used to import more than single machine capacity data in ES cluster, and the client operates any node interface to read and write the cluster data. (not explained in detail)

Gateway to the gates of time

During the run, ES receives information such as the status of the cluster, the parameters of the index, and so on. These chant are stored in the gateway.

    

The core concept behind ES:

ES is a few of the few concepts that are constructed very rarely.

Open the box urgently.

Natural clusters.

Automatic fault tolerance.

Strong extensibility.

   

How ES works:

Startup process:

When the ES node is started, it takes advantage of multicast (multicast) or unicast (don't ask me what is unicast, multicast, no need to tangle these) to find a link to a sick resume.

In the cluster, a node is elected as the master node. This node manages the state of the cluster and assigns the index partition to the relative node when the cluster's expansion structure changes.

At the user's point of view, the node does not occupy the main position in ES, which is different from other systems (database system). In fact, the user does not need to know which node is the primary node, all operational requirements can be sent to any node, ES internal completion of these tasks. Whenever necessary, any node can distribute the query to other nodes concurrently, and then merge the query results returned by each node. Finally, the user is returned with a complete result set. All of this work does not need to go through the primary node forwarding (the node communicates through peer-to-peer communication).

When necessary, recovery is done. The master node then checks which shards are available and determines which shards to use. After processing, the cluster goes into a yellow state.

This means that the cluster can process the search request. But there is no full firepower (mainly because all the primary index partitions are already allocated, but the index copy is not yet). The next step is to find the copied partition, and the disease is set as an index copy. When the number of partitions is too little, the master node defines which node to put the missing partitions and creates replicas from the primary partition. Once all the work is done, the cluster will turn green (indicating that all the index copies of the primary partition have been allocated).

Probe, yes. Node

During normal operation, the master node monitors all nodes to see if each node is functioning properly. If the node is unreachable for the specified time, it is considered to be a failure, and the error handler is then started. The cluster needs to be balanced-because the node fails, the index shards assigned to that node are lost. In fact, the corresponding partition on the node will be the work to take over the process can be configured to meet user needs.

Because it only shows how ES works, take a cluster of three nodes as an example. There is one master node and two data nodes in the cluster. The primary node waits for a response after the other node wants to send a ping command. If you get a response (you may not actually get the number of restore pings, depending on the user configuration), the node will be moved out of the cluster.

  

Communicating with ES

Ultimately, the most important thing is how to add data to ES and how to query the data. ES provides APIs that are based on restful styles. And these APIs are very easy to cluster with other systems capable of handling HTTP.

Es for data should be accompanied in the URL, or as the request body requst body. Sent to the server in a JSON-formatted document.

Intra-ES, the communication between the nodes is explained by the first off Javaapi.

Here's the point.

Index data

ES provides methods for indexing data in 4. The simplest is the index API. It allows you to add a document to the specified index. Like the Curl tool. I can create a new document with the following command

2nd, 3 methods, you can add documents in bulk through the bulk API and the UDP API. The usual Bulk API uses the HTTP protocol, and the UDP bulk API uses a non-connected packet protocol. The UDP protocol transmits faster, but the reliability almost, the last one is through the Rivier plug-in. The river runs on the nodes of the ES cluster and is able to fetch data from the external system.

It is important to note that the CAO Group of index data only occurs on the primary partition and does not occur on the partition replica. If the request to the index data is sent to a node that does not have an appropriate Shard or shard copy, the request is forwarded to the node that contains the primary partition.

Data query

The query API has a large proportion in ES. Using the query DSL (JSON-based for constructing complex languages)

Use type query: simple keyword, phrase, interval, Boolean, blur, span, wildcard, geographical location and other query methods.

Complex queries are constructed by combining simple queries.

Filter documents, remove non-compliant documents and do not affect scoring sequencing.

Finds similar documents for a given document.

Find search suggestions and query phrase corrections for a given phrase.

Build dynamic navigation and data statistics with faceting.

Use prospective search and find a query statement that matches the written document (prospective search is a push method. The user's query statements are stored in the index, and if a new document is added to the index, the document is associated with the matching query statement. This is suitable for news, blogs and other timed updates of the scene).

The core of the data query is that the query process is not a simple, single process. This process is divided into two stages: the query phase and the result summary phase. In the query distribution phase, the data is queried from each branch, and in the result summary phase, the results are merged, sorted, and then returned to the user from each grouping.

Users can control the distribution and summarization of queries by specifying the type of search.

Index parameter settings

Es index parameters are automatically configured

The document structure and the domain type are automatically recognized. Of course es also allows users to customize the default configuration.

For example, you configure many parameters yourself, such as configuring the document structure in the index through mapping, setting the number of partitions shard and replicas replica, setting up text components ...

Cluster management and monitoring

Through the Management and Monitoring section of the API, users can change the cluster settings. such as adjusting the node discovery mechanism or changing the index of the Shard policy. The user can view the cluster status information, or each node and index and statistical information. The API for cluster monitoring is very extensive.

Powerful User Query Language DSL

IF/IDF Scoring Formula

This is the true face of the scoring formula. If you just want to adjust the correlation between query statements, you don't have to understand how it works. But just search and know how it works.

The scoring formula in Lucene concept

It shows the combination of Boolean information retrieval and vector space information retrieval model. (This is temporarily ignored)

You can find out more about what you can do here.

Scoring ranking from ES perspective

Most importantly, the ES built with Lucene allows users to modify the default scoring algorithm. But ES is not just a simple package for Lucene, because in es, document sorting is not entirely dependent on Apache Lucene's scoring algorithm. ES implements a number of different types of queries that can depend entirely on how the document is scored, and ES allows you to customize how the document is scored by script.

Query rewriting mechanism

If you've ever used many different types of queries, such as prefix queries and wildcard queries, in essence, any query can be viewed as multiple keyword queries. Query Rewrite, ES overrides the user query to ensure performance. The rewrite process, right? The Lucene perspective considers the primitive, overhead query object to be transformed into a sequence of small, overhead query objects.

Prefix query:

For example:

I'm a document that knows all the characters that start with J. This requirement is very simple and runs on the client index

Re-scoring of query results

Some scenarios are necessary to re-rate the result document of a query statement. The reason for the re-rating may be the same.

One reason may be performance considerations, such as the cost of reordering the entire ordered set of results, and usually only reordering the result set.

Understanding the re-scoring

In Es, a re-rating is a process of scoring a limited number of query results again. This means that ES will reorder the top n documents of the result according to the new scoring rules

Cases

Rescore the structure of the query:

Parameters for re-scoring

In the Rescore object of the query statement, the user can add the following parameters

Window_size provides information about n documents. Number of documents used to perform the re-scoring on the Shard

Query_weight default 1; The score of the original query is multiplied by query_weight and then added to the Rescore score.

Rescore_query_weight The default 1,rescore score is multiplied by the value, which is added to the score of the original query.

Rescore_mode default tatal; Introduced in es0.90.0 to specify the scoring method for the re-scoring document. Optional values: Total,max,avg and multiply.

Total: The final score is the original query score and the Rescore score;

Max, the final score is the maximum value of the original query score and the Rescore score;

Min, the final score is the minimum value of the original query score and the Rescore score;

AVG, the final score is the average of the original query score and Rescore score;

Multiply, the scores of the two queries are multiplied.

For example, setting the Recore_mode parameter value to total, the document final score is

Sorting of query Results

When you send a query command to ES, the returned collection of documents is sorted by default according to the calculated document score. This is usually

What the user wants: the first document in the result set is the one that the query command wants. However, there is something we want to change this sort of

  

  

  

  

Update API

When a new document is added to the index, the underlying Lucene toolkit parses each domain, generating a token stream

The token stream is filtered to get an inverted index. In this process, input the text version of a series of unnecessary information will be discarded.

These unnecessary information may be the location of some special words, some discontinued words or words substituted with synonyms, or endings

。 This is why it is not possible to modify the documents in Lucene, and each time a document is modified, all fields of the document must be added to the index. ES stores or retrieves real data from the document by _source this proxy domain.

When we want to update the document, ES will put the data into the _souce domain, then make the changes, and finally the updated document

Added to the index. Let's assume that this feature of the _source domain must take effect. Document UPDATE command can only update one document

The document update for the query command is not yet out.

Update:

  

Create or delete a document using the Update API

The update API can not only modify a domain, but also manipulate the entire document.

The Upsert feature makes it possible to navigate to a nonexistent document and it will be created to love you out:

  

If the document exists, the command resets the value in the Year field, otherwise it is created. The new document contains the Titile domain defined in upset. Of course, the above command also has the ability to use scripts:

  

Update also allows users to selectively delete entire documents.

          

Filters Optimization Query:

ES supports multiple types of queries, but queries that match successfully and which should be presented to the user are not unique. ES query DSL allows users to use a large number of queries will have their own logo.

  

Filters (filters) and caches

    

 

ES search engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.