ElasticSearch 2 (10)-under ElasticSearch (in-depth understanding of Shard and Lucene Index)

Source: Internet
Author: User

Summary

Introducing the internal principles of Elasticsearch Shard from the bottom and answering why is it necessary to understand the internal workings of Lucene using Elasticsearch?

    • Understand the cost of the Elasticsearch API

      • Build a FAST Search application
      • Don't commit at any time.
      • When to use stored fields and document Values
      • Lucene may not be the right tool
    • Understanding how indexes are stored

      • The term vector is 1/2 of the index size
      • I removed 20% of the files, but there was no change in the index footprint
Version

Elasticsearch version: elasticsearch-2.2.0

Content index

It is no exaggeration to say that if you do not understand how the Lucene index works, you can say that you do not understand lucene at all, especially for elasticsearch.

    • can make search faster
      • can be redundant information
      • Indexing based on query (queries)
    • Compromise between update speed and query speed

      It is important to note that the search scenario

      • Grep vs. Full-text Search (Full-text indexing)
      • Prefix queries vs. Edge N-grams
      • Phrase Queries vs Shingles

      If it is a prefix query (right fuzzy match) or a phrase query (phrase queries), Elasticsearch may not be appropriate and special optimizations need to be made. (In 2.x, ES has support for the above scenarios, depending on how you use it: Search in Depth)

    • The speed of the Lucene index
      • Http://people.apache.org/~mikemccand/lucenebench/indexing.html
Create an index

Take two simple files for example: Lucene in action and databases.

Assume that there are words in the Lucene in action

{index, term, data, Lucene}

There are words in databases.

index, data}
  • Tree-shaped structure (structure)

    For range Query Order
    The time complexity of the query is O (log (n))

    General relational database The approximate structure may be the above a B-plus tree, but Lucene is another storage structure.

  • Inverted index (or reverse index inverted)

    For Lucene, the main storage structure is a reverse index, which is an array in which an ordered data dictionary is inside.

    Such a storage structure exists in the segment with Lucene.

    • Term ordinal--is the ordinal of a word
    • Term dict--is the content of the word
    • Postings list--The ID sequence of the file containing the word
    • Doc id--is a unique identifier for each file
    • document--storing the contents of each file

    An important difference between the two structures is that when files are added or deleted, the system is frequently manipulated by the tree structure, which is constantly changing, and the reverse index can remain unchanged (immutable).

  • Insert?
    • Insert to create a new segment
    • When there are many segment, the system merges segment
      This process is essentially a merge sort, and the thing to do is

      • Connection file
      • Merging dictionaries
      • Merging postings lists

  • Delete?
    • Delete To do is just set a flag bit
    • The system ignores deleted files when searching and merge.
    • When a lot of deletions occur, the system automatically runs the merge
    • Files that are marked as deleted will reclaim the storage space they occupy after the merge is completed

  • What is the best or inferior?
    • When we update a file, we actually create a new segment, so

      • Updating individual files is expensive and we need to use Bulk update
      • All write operations are executed sequentially.
    • Segments will never be modified

      • File system Cache Friendly
      • There is no lock issue
    • Terms Height to weight

      • Save a lot of space occupied by high-frequency words
    • The file itself is identified by a unique ordinal

      • Very convenient when communicating across APIs
      • Lucene can use multiple indexes under a single query index
    • Terms is identified by a unique ordinal number

      • Important for sorting, you only need to compare numbers, not strings
      • Very important for faceting (faceted search)
  • The strength of Lucene Index (index intersection)

    Many databases do not support using multiple indexes at the same time, but Lucene supports

    • Lucene maintains a skip list (Wiki) for postings lists, and if you want to search for "red shoe" in the example above, the information in the System reference skip list can be retrieved by jumping ("Leap-frog")

    • For many databases, they pick the primary index (most selective) and ignore the other

    Refer to the detailed index intersection algorithm and how to use the Skip list (nlp.standford.edu)

More indexes
  • Terms vector (term vectors)
    • Create a reverse index for each file (inverted index)
    • Scenario: Search for more similar content
    • can also be used as a highlight search result

  • File value (document values)
    • Column storage in file fields
    • Applicable scenarios: Sorting, weight scoring

  • Ordered (collection) file values
    • Orderly file, orderly field

      • Single field: Sort
      • Multi-field: faceted search

  • Faceted Search (faceting)

    Facet refers to the multidimensional attributes of a thing. For example, a book contains topics, authors, and eras. Faceted search is a method of filtering and filtering search results through these attributes of things. Faceted search can be seen as a combination of search and browse. As an effective search method, faceted search has been used in many aspects such as e-commerce, music, tourism and so on.

    For example, Google Music's pick song page, divides the song into rhythm, tone, tone, age, genre and other facets

    • Count based on file matching with search

      • For example, the e-commerce website depends on the style, length, size and color of the clothes.
    • Simple (naive) solution

      • Use hash table count (value to count)
      • O (#docs) ordinal find
      • O (#doc) value lookup
    • Lucene Solutions

      • Hash table (Ord to count)
      • Last Statistic value
      • O (#docs) ordinal find
      • O (#values) value lookup

    Because the ordinal is dense, it can be simply represented by an array of arrays.

  • How do I use the API?

    The Elasticsearch Advanced API is built on the Lucene API, which includes the following basic APIs:

    -----------------------------------------------------------------------------------------------API |   Use | Method-----------------------------------------------------------------------------------------------Inverted Index |Term-Doc IDs, positions, offsets |Atomicreader.fields-----------------------------------------------------------------------------------------------Stored Fields |Summariesof search Results |Indexreader.document-----------------------------------------------------------------------------------------------live Docs | ignoring deleted Docs | atomicreader.livedocs--------------------------------------- --------------------------------------------------------term vectors | more like this | indexreader.termvectors------------------------------------- ----------------------------------------------------------doc values/ norms | sorting/faceting/scoring | atomicreader.get*values------------ -----------------------------------------------------------------------------------  
  • Summary
    • There are four duplicates of the data, but the structure is different.

      • It's not a waste of space
      • Thanks immutable make data easy to manage
    • Stored fields and document Values

      • Two structures optimized for different usage scenarios

        1. Few files get multiple fields: Stored field
        2. A large number of files get a few fields: Document Values

Secret of file format
  • The rules you can't forget
    • To save a handle to a file

      Do not use files for each file in each field

    • Avoid disk addressing

      The disk addressing time is about ~10MS

    • Do not ignore the file system cache

      Random access to small files is still possible.

    • Use light compression

      • Less I/O
      • Smaller index
      • File system Cache Friendly
  • Encoding and decoding
    • File format dependency and encoding decoding
    • The default encoding format has optimized the relationship between memory and speed

      Do not use Ramdirectory, Memorypostingsformat, Memorydocvaluesformat.

    • More information reference

      Http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.
      Html

  • appropriate compression technology
    • Bit packing/vlnt encoding

      • postings lists
      • numeric doc values< /li>
    • LZ4

      • code.google.com/p/lz4
      • lightweight compression algorithm
      • Stored fields, Term vectors
    • fsts

      • map<string,? = "";
      • key share prefix (prefix) and suffix (su Ffix)
      • Terms index
  • termquery behind
    1. Terms index

      Find the appropriate word in the index

      • FST stored word prefixes in memory prefix
      • provide word in dictionary Offset
      • can fail quickly when it does not exist

    2. Terms Dictionary

      • Jump to the location of the dictionary offset

        Pressure The Blocktree is based on a shared prefix, and reads "Dict" like

      • Sequentially until a specific term is found

    3. Postings Lis T

      • jump to postings list offset location
      • Incremental encoding with improved for Frame Reference

        1. Delta Compilation Code
        2. splits a block into the size of a n=128 value
        3. Each block uses bit compression (bit packing)
        4. If there are remaining documents, use VLNT compression

    4. Stored fields

      • Doc ID for a subset, indexed in memory

        Efficient memory (monotonic) compression

        Two-minute search Find

      • Field

        Order store

        Use 16KB block storage compression

  • Summary of the query process
    • 2 disk addressing per field
    • 1 disk addressing per file (Stored fields)

    • Terms dict/postings lists are in the file system cache

      Disk addressing does not occur at this time

    • "Pulse" optimization

      • For the unique term,postings list stored in the terms Dict
      • 1-Time Disk addressing
      • Always as primary key
Performance

There were two drops in system performance, possibly

    1. Index growth exceeds the size of the file system cache

      Stored fields are no longer stored in the cache

    2. Terms dict/postings lists not all in the cache

Reference

Reference Source:

Slideshare:what is in a Lucene index?

Youtube:what is in a Lucene index? Adrien Grand, software Engineer, Elasticsearch

Slideshare:elasticsearch from the Bottom up

Youtube:elasticsearch from the bottom up

Wiki:document-term Matrix

Wiki:search engine Indexing

Skip List

Standford Edu:faster Postings list intersection via skip pointers

Faceted searches (faceted search)

Stackoverflow:how A search index works when querying many words?

Stackoverflow:how does Lucene calculate intersection of documents so fast?

Lucene and its magical indexes

End

ElasticSearch 2 (10)-under ElasticSearch (in-depth understanding of Shard and Lucene Index)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.