Turn: Under Elasticsearch (the story of the plot search)

Source: Internet
Author: User

ElasticSearch 2 (9)-a summary of the story under ElasticSearch (a plot search)

First top-down, after the bottom-up introduction of the elasticsearch of the bottom of the working principle, to try to answer the following questions:

    • Why doesn't my search *foo-bar* match foo-bar ?

    • Why do you add more files to compress indexes (index)?

    • Why does Elasticsearch occupy a lot of memory?

Version

Elasticsearch version: elasticsearch-2.2.0

Content plots elasticsearch clusters on the cloud

The box in the cluster

Each white square box in the cloud represents a node--node.

Between nodes

In one or more nodes directly, multiple green squares are grouped together to form an Elasticsearch index.

Small squares in the index

Under an index, a small green block distributed across multiple nodes is called a shard--shard.

Shard=lucene Index

A Elasticsearch shard is essentially a lucene Index.

Lucene is a full Text search library (there are also many other forms of search libraries), and Elasticsearch is built on Lucene. The rest of the story is actually about how Elasticsearch works based on Lucene.

Graphical Lucenemini Index--segment

There are a lot of small segment in Lucene, and we can think of them as mini-index inside Lucene.

Segment interior

Has a lot of data structures

    • Inverted Index
    • Stored fields
    • Document Values
    • Cache

The most important inverted Index

Inverted index mainly consists of two parts:

    1. An Ordered data dictionary dictionary (including the word term and the frequency at which it appears).
    2. A postings (that is, a file that has the word) that corresponds to a term.

When we search, we first decompose the contents of the search, and then find the corresponding term in the dictionary to find the content of the file related to the search.

Query "The Fury"

Auto-complete (autocompletion-prefix)

If you want to find letters that start with the letter "C", you can simply find words such as "choice", "coming" in the inverted Index table by binary search (binary searches).

Expensive to find

If you want to find all the words that contain "our" letters, then the system will scan the entire inverted Index, which is very expensive.

In this case, if you want to optimize, then the question we face is how to generate the appropriate term.

The transformation of the problem

For the above-mentioned issues, we may have several possible solutions:

    • * Suffix-xiffus *

      If we want to use the suffix as a search condition, we can reverse-process the term.

    • (60.6384, 6.5017), U4U8GYYKK

      For Geo location information, you can convert it to Geo Hash.

    • 123, {1-hundreds, 12-tens, 123}

      For simple numbers, you can generate multiple forms of term for it.

Resolve spelling errors

A python library creates a tree-state machine that contains misspelled information for a word, which solves the problem of spelling errors.

Stored Field Fields Lookup

Inverted index does not solve this problem when we want to find a file that contains a particular header content, so Lucene provides another data structure stored fields to solve this problem. Essentially, Stored fields is a simple key-value pair of key-value. By default, Elasticsearch stores the JSON source for the entire file.

Document values in order to sort, aggregate

Even so, we find that the above structure still does not solve such things as: sorting, aggregation, facet, because we may want to read a lot of unnecessary information.

So, another data structure solves this problem: Document Values. This structure is essentially a column-like storage, which optimizes the storage structure with the same type of data.

To improve efficiency, elasticsearch can read all of the document value under the index into memory, which greatly increases the access speed, but also consumes a lot of memory space.

In summary, these data structures inverted Index, Stored fields, Document values, and their caches, all within segment.

When a search occurs

When searching, Lucene searches for all segment and then returns each segment search result, which is then presented to the customer in the final merge.

Some of the features of Lucene make this process very important:

    • Segments is immutable (immutable)

      • Delete? When the deletion occurs, Lucene does just the deletion of its flag position, but the file will still be in its original place and will not change.

      • Update? So for the update, essentially it does the job of deleting and then re- indexing (re-index)

    • The ubiquitous compression

      Lucene is very good at compressing data, and basically all of the textbook compression methods can be found in Lucene.

    • Cache all of all

      Lucene also caches all of the information, which greatly improves its query efficiency.

The cached story

When the Elasticsearch index a file, the corresponding cache is created for the file, and the data is refreshed periodically (every second), and the files can be searched.

With the increase of time, we will have a lot of segments,

So Elasticsearch will merge these segment, and in the process, segment will eventually be erased.

This is why adding files may make the index occupy less space, it causes the merge, and there may be more compression.

Give me a chestnut.

There are two segment that will merge.

These two segment will eventually be deleted and then merged into a new segment

At this point the new segment is in the cold state in the cache, but most segment remain unchanged and in warm state.

The above scenes often occur within the Lucene index.

Search in Shard

Elasticsearch the process of searching from Shard is similar to the process of searching in Lucene segment.

Unlike the search in Lucene segment, Shard may be distributed across different node, so when searching and returning results, all information is transmitted over the network.

It is important to note that:

1 searches find 2 shard = 2 times respectively Search Shard

Handling of log files

When we want to search for logs generated by a particular date, the search efficiency is greatly improved by chunking and indexing log files based on timestamps.

It's also handy when we want to delete old data, just delete the old index.

In the case of the above, each index has two shards

How to scale

Shard does not make further splits, but Shard may be transferred to different nodes

So, if the cluster node pressure increases to a certain extent, we may consider adding new nodes, which will require us to re-index all the data, which we do not want to see, so we need to consider at the time of planning, how to balance the relationship between enough nodes and insufficient nodes.

Node Assignment and shard optimization
    • Assigning better-performing machines to more important data index nodes
    • Make sure that each shard has a copy of the information replica

Routing Routing

Each node has a single copy of the routing table, so when requested to any one node, Elasticsearch has the ability to forward the request to the Shard of the desired node for further processing.

A real request.

Query

Query has a type filtered, and a multi_match of queries

Aggregation

According to the author of the aggregation, get Top10 hits of the TOP10 author's information

Request distribution

This request may be distributed to any node in the cluster

God node

The node then becomes the coordinator of the current request (coordinator), which decides:

    • Depending on the index information, determine which core node the request will be routed to
    • And which copy is available
    • Wait a minute
Routing

Before the real search

ElasticSearch will convert query into Lucene query

Then perform the calculations in all the segment

There is also a cache for the filter condition itself

But queries will not be cached, so if the same query is executed repeatedly, the application needs to do its own caching

So

    • Filters can be used at any time
    • Query is only used when score is required
Return

After the search is finished, the result is returned up and down the path along the downstream line.

Reference

Reference Source:

Slideshare:elasticsearch from the Bottom up

Youtube:elasticsearch from the bottom up

Wiki:document-term Matrix

Wiki:search engine Indexing

Skip List

Standford Edu:faster Postings list intersection via skip pointers

Stackoverflow:how A search index works when querying many words?

Stackoverflow:how does Lucene calculate intersection of documents so fast?

Lucene and its magical indexes

Misspellings 2.0c:a tool to detect misspellings

Turn: Under Elasticsearch (the story of the plot search)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.