Full Text Search
Now that we've discussed some simple use cases for searching structured data, it's time to start exploring full-Text search-How to find the most relevant documents in a whole-word field.
The two most important aspects of a full-text search are:
Correlation degree (relevance)
The results of the query are based on their ability to sort the relevance of the query itself, and the correlation can be obtained by TF/IDF, see what is relevance, proximity to a geographic location (Proximity to a geo-location), fuzzy similarity (blur similarity) or other algorithms for calculation.
Parsing (analysis)
Parsing is used to convert a piece of text into a separate, normalized entry (Tokens), see Parsing and parser (analysis and analyzers), to complete: (a) the creation of an inverted index (inverted), and (b) the query for the inverted index.
Once we start discussing correlation or parsing, it means that we are stepping into the field of query, not the filter.
Based on terms (term-based) and full text (Full-text)
Although all queries perform some degree of correlation calculation, not all queries have a parsing phase. In addition bool
to function_score
special queries such as or those that do not operate entirely on text, queries for text can be divided into two categories:
Entry-based query (term-based Queries)
Similar term
and fuzzy
queries are low-level queries that do not contain the parsing phase (low-level Queries). They operate on a single entry. A query for an entry Foo
term
searches for an exact match for that entry in the inverted index (exact term), and then calculates the correlation by TF/IDF for each document that contains the entry _score
.
In particular, it is important to remember that the term
query only looks for an exact match in the inverted index-it does not match a variant such as foo
or FOO
. It does not care how the entry is saved to the index. If you index ["Foo", "Bar"]
to a not_analyzed
field, or Foo Bar
index to a whitespace
parse field (Analyzed field) that uses the parser, they will get two entries in the inverted index: "Foo"
as well "Bar"
.
Full-text query (Full-text Queries)
Similar match
or query_string
such queries are advanced queries (high-level Queries) that understand the mapping of a field:
- If you use them to query
date
for one or more integer
fields, they will treat the query string as either a date or an integer number, respectively.
- If you query an exact value (
not_analyzed
) string field, they will use the entire query string as a separate entry.
- But if you query a full-text segment (
analyzed
), they will first pass the query string to the appropriate parser to get a list of terms that need to be queried.
Once the query has a list of entries, it uses each entry in the list to execute the appropriate low-level query and then merges the resulting results, resulting in a correlation score for each document.
We will discuss this process in detail in the following sections.
In rare cases, you need to use the term-based query directly (term-based Queries). Usually you need to query the full text instead of the separate entries, and this work is easier to do with advanced full-text queries (in-house they end up using the term-based low-level query).
If you find that you really need to not_analyzed
query an exact value on a field, consider whether you really need to use a query instead of a filter.
Word-bar queries typically represent a two-dollar yes|no
problem, which is usually more appropriately expressed using filters, so they can also benefit from filter caching (filter Caching):
GET/_search{ "Query": { "filtered" filter"term" gender"female"}}} }}
match
Inquire
When you need to query any field, the match
query should be your first choice. It is an advanced full-text query, meaning it knows how to handle full-text segments (Full-text, analyzed
) and exact value fields (Exact-value, not_analyzed
).
Even so, match
the main usage scenario for queries is still full-text search. Let's take a simple example to see how the full-text search works.
Index some data
First, we'll create a new index and bulk
index some documents through the API:
Delete/my_index put/my_index{"Settings": {"Number_of_shards":1}} post/my_index/my_type/_bulk{The index": {"_id":1}}{The title":"The quick brown fox" }{The index": {"_id":2}}{The title": "The quick brown fox jumps over the lazy dog"} { "Index": { "_id< Span class= "Pl-pds" ": 3}}{" title " The quick brown fox jumps over the quick Dog "} {" Index ": {< Span class= "Pl-pds" > "_id": 4}}{ "Title": "Brown Fox Brown dog< Span class= "Pl-pds" "}
Notice that above when creating the index, we set the number_of_shards
1: At a later time the correlation is broken (relevance is broken) section, we will explain why this creates an index with only one primary shard (Primary shard).
Word queries (single word query)
The first example will explain match
what happens when you use a query to search for a word in a full-text field:
get/my_index/my_type/_search{ "Query": { "match" title"quick!" }}}
ES executes the above query in the following way match
:
Check field type
title
A field is a full-text string field ( analyzed
), which means that the query string also needs to be parsed.
Parse query string
The query string is "QUICK!"
passed into the standard parser, resulting in a single entry "quick"
. Because we get only one entry, the match
query uses a term
low-level query to execute the query.
Find a matching document
term
The query queries the inverted index and "quick"
then gets the list of documents that contain the entry, in which case the document is 1
2
3
returned.
Score each document
term
The query calculates its relevance score for each matching document, which is calculated _score
by taking into account the frequency of the entry (term Frequency) (the frequency of occurrences in the "quick"
field of each document that matches title
), and the frequency of the rewind (inverted document Frequency) (the extent to which the "quick"
fields of all documents in the entire index title
appear), and the length of each field (shorter fields are considered more relevant) to get. Refer to what are correlations (what are relevance?)
This process will give us the following result (with ellipsis):
"Hits": [ {"_id":"1","_score":0.5,"_source": {The title":"The quick brown fox" } }, {"_id":"3","_score":0.44194174, "_source": { "Title": "the Quick brown fox jumps over the quick dog "}}, {" _id" 2 " "_score": 0.3125, "_source": { "Title": "The quick brown fox jumps over the lazy dog"}}]
Document 1 is most relevant because its title
field is short, which means that quick
it is relatively large in its expressed content. Document 3 is more relevant than document 2 because quick
it appears two times.
REFERENCE from:http://blog.csdn.net/dm_vincent/article/details/41693125
[Elasticsearch] Full Text Search (i)-Basic concepts and match queries