Summary
Introducing the internal principles of Elasticsearch Shard from the bottom and answering why is it necessary to understand the internal workings of Lucene using Elasticsearch?
Version
Elasticsearch version: elasticsearch-2.2.0
Content index
It is no exaggeration to say that if you do not understand how the Lucene index works, you can say that you do not understand lucene at all, especially for elasticsearch.
- can make search faster
- can be redundant information
- Indexing based on query (queries)
- Compromise between update speed and query speed
It is important to note that the search scenario
- Grep vs. Full-text Search (Full-text indexing)
- Prefix queries vs. Edge N-grams
- Phrase Queries vs Shingles
If it is a prefix query (right fuzzy match) or a phrase query (phrase queries), Elasticsearch may not be appropriate and special optimizations need to be made. (In 2.x, ES has support for the above scenarios, depending on how you use it: Search in Depth)
- The speed of the Lucene index
- Http://people.apache.org/~mikemccand/lucenebench/indexing.html
Create an index
Take two simple files for example: Lucene in action and databases.
Assume that there are words in the Lucene in action
{index, term, data, Lucene}
There are words in databases.
index, data}
- Tree-shaped structure (structure)
For range Query Order
The time complexity of the query is O (log (n))
General relational database The approximate structure may be the above a B-plus tree, but Lucene is another storage structure.
- Inverted index (or reverse index inverted)
For Lucene, the main storage structure is a reverse index, which is an array in which an ordered data dictionary is inside.
Such a storage structure exists in the segment with Lucene.
- Term ordinal--is the ordinal of a word
- Term dict--is the content of the word
- Postings list--The ID sequence of the file containing the word
- Doc id--is a unique identifier for each file
- document--storing the contents of each file
An important difference between the two structures is that when files are added or deleted, the system is frequently manipulated by the tree structure, which is constantly changing, and the reverse index can remain unchanged (immutable).
- Insert?
- Delete?
- Delete To do is just set a flag bit
- The system ignores deleted files when searching and merge.
- When a lot of deletions occur, the system automatically runs the merge
- Files that are marked as deleted will reclaim the storage space they occupy after the merge is completed
- What is the best or inferior?
When we update a file, we actually create a new segment, so
- Updating individual files is expensive and we need to use Bulk update
- All write operations are executed sequentially.
Segments will never be modified
- File system Cache Friendly
- There is no lock issue
Terms Height to weight
- Save a lot of space occupied by high-frequency words
The file itself is identified by a unique ordinal
- Very convenient when communicating across APIs
- Lucene can use multiple indexes under a single query index
Terms is identified by a unique ordinal number
- Important for sorting, you only need to compare numbers, not strings
- Very important for faceting (faceted search)
- The strength of Lucene Index (index intersection)
Many databases do not support using multiple indexes at the same time, but Lucene supports
Lucene maintains a skip list (Wiki) for postings lists, and if you want to search for "red shoe" in the example above, the information in the System reference skip list can be retrieved by jumping ("Leap-frog")
For many databases, they pick the primary index (most selective) and ignore the other
Refer to the detailed index intersection algorithm and how to use the Skip list (nlp.standford.edu)
More indexes
- Terms vector (term vectors)
- Create a reverse index for each file (inverted index)
- Scenario: Search for more similar content
- can also be used as a highlight search result
- File value (document values)
- Column storage in file fields
- Applicable scenarios: Sorting, weight scoring
- Ordered (collection) file values
- Faceted Search (faceting)
Facet refers to the multidimensional attributes of a thing. For example, a book contains topics, authors, and eras. Faceted search is a method of filtering and filtering search results through these attributes of things. Faceted search can be seen as a combination of search and browse. As an effective search method, faceted search has been used in many aspects such as e-commerce, music, tourism and so on.
For example, Google Music's pick song page, divides the song into rhythm, tone, tone, age, genre and other facets
Because the ordinal is dense, it can be simply represented by an array of arrays.
- How do I use the API?
The Elasticsearch Advanced API is built on the Lucene API, which includes the following basic APIs:
-----------------------------------------------------------------------------------------------API | Use | Method-----------------------------------------------------------------------------------------------Inverted Index |Term-Doc IDs, positions, offsets |Atomicreader.fields-----------------------------------------------------------------------------------------------Stored Fields |Summariesof search Results |Indexreader.document-----------------------------------------------------------------------------------------------live Docs | ignoring deleted Docs | atomicreader.livedocs--------------------------------------- --------------------------------------------------------term vectors | more like this | indexreader.termvectors------------------------------------- ----------------------------------------------------------doc values/ norms | sorting/faceting/scoring | atomicreader.get*values------------ -----------------------------------------------------------------------------------
- Summary
There are four duplicates of the data, but the structure is different.
- It's not a waste of space
- Thanks immutable make data easy to manage
Stored fields and document Values
Secret of file format
- The rules you can't forget
To save a handle to a file
Do not use files for each file in each field
Avoid disk addressing
The disk addressing time is about ~10MS
Do not ignore the file system cache
Random access to small files is still possible.
Use light compression
- Less I/O
- Smaller index
- File system Cache Friendly
- Encoding and decoding
- File format dependency and encoding decoding
The default encoding format has optimized the relationship between memory and speed
Do not use Ramdirectory, Memorypostingsformat, Memorydocvaluesformat.
More information reference
Http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.
Html
- appropriate compression technology
- termquery behind
-
Terms index
Find the appropriate word in the index
- FST stored word prefixes in memory prefix
- provide word in dictionary Offset
- can fail quickly when it does not exist
-
Terms Dictionary
-
Jump to the location of the dictionary offset
Pressure The Blocktree is based on a shared prefix, and reads "Dict" like
-
Sequentially until a specific term is found
-
Postings Lis T
-
Stored fields
-
Doc ID for a subset, indexed in memory
Efficient memory (monotonic) compression
Two-minute search Find
-
Field
Order store
Use 16KB block storage compression
- Summary of the query process
- 2 disk addressing per field
1 disk addressing per file (Stored fields)
Terms dict/postings lists are in the file system cache
Disk addressing does not occur at this time
"Pulse" optimization
- For the unique term,postings list stored in the terms Dict
- 1-Time Disk addressing
- Always as primary key
Performance
There were two drops in system performance, possibly
Index growth exceeds the size of the file system cache
Stored fields are no longer stored in the cache
Terms dict/postings lists not all in the cache
Reference
Reference Source:
Slideshare:what is in a Lucene index?
Youtube:what is in a Lucene index? Adrien Grand, software Engineer, Elasticsearch
Slideshare:elasticsearch from the Bottom up
Youtube:elasticsearch from the bottom up
Wiki:document-term Matrix
Wiki:search engine Indexing
Skip List
Standford Edu:faster Postings list intersection via skip pointers
Faceted searches (faceted search)
Stackoverflow:how A search index works when querying many words?
Stackoverflow:how does Lucene calculate intersection of documents so fast?
Lucene and its magical indexes
End
ElasticSearch 2 (10)-under ElasticSearch (in-depth understanding of Shard and Lucene Index)