[Elasticsearch] Full Text Search (i)-Basic concepts and match queries

Source: Internet
Author: User
Tags idf

Full Text Search

Now that we've discussed some simple use cases for searching structured data, it's time to start exploring full-Text search-How to find the most relevant documents in a whole-word field.

The two most important aspects of a full-text search are:

Correlation degree (relevance)

The results of the query are based on their ability to sort the relevance of the query itself, and the correlation can be obtained by TF/IDF, see what is relevance, proximity to a geographic location (Proximity to a geo-location), fuzzy similarity (blur similarity) or other algorithms for calculation.

Parsing (analysis)

Parsing is used to convert a piece of text into a separate, normalized entry (Tokens), see Parsing and parser (analysis and analyzers), to complete: (a) the creation of an inverted index (inverted), and (b) the query for the inverted index.

Once we start discussing correlation or parsing, it means that we are stepping into the field of query, not the filter.

Based on terms (term-based) and full text (Full-text)

Although all queries perform some degree of correlation calculation, not all queries have a parsing phase. In addition bool to function_score special queries such as or those that do not operate entirely on text, queries for text can be divided into two categories:

Entry-based query (term-based Queries)

Similar term and fuzzy queries are low-level queries that do not contain the parsing phase (low-level Queries). They operate on a single entry. A query for an entry Foo term searches for an exact match for that entry in the inverted index (exact term), and then calculates the correlation by TF/IDF for each document that contains the entry _score .

In particular, it is important to remember that the term query only looks for an exact match in the inverted index-it does not match a variant such as foo or FOO . It does not care how the entry is saved to the index. If you index ["Foo", "Bar"] to a not_analyzed field, or Foo Bar index to a whitespace parse field (Analyzed field) that uses the parser, they will get two entries in the inverted index: "Foo" as well "Bar" .

Full-text query (Full-text Queries)

Similar match or query_string such queries are advanced queries (high-level Queries) that understand the mapping of a field:

    • If you use them to query date for one or more integer fields, they will treat the query string as either a date or an integer number, respectively.
    • If you query an exact value ( not_analyzed ) string field, they will use the entire query string as a separate entry.
    • But if you query a full-text segment ( analyzed ), they will first pass the query string to the appropriate parser to get a list of terms that need to be queried.

Once the query has a list of entries, it uses each entry in the list to execute the appropriate low-level query and then merges the resulting results, resulting in a correlation score for each document.

We will discuss this process in detail in the following sections.

In rare cases, you need to use the term-based query directly (term-based Queries). Usually you need to query the full text instead of the separate entries, and this work is easier to do with advanced full-text queries (in-house they end up using the term-based low-level query).

If you find that you really need to not_analyzed query an exact value on a field, consider whether you really need to use a query instead of a filter.

Word-bar queries typically represent a two-dollar yes|no problem, which is usually more appropriately expressed using filters, so they can also benefit from filter caching (filter Caching):

GET/_search{    "Query": {        "filtered" filter"term" gender"female"}}} }}


When you need to query any field, the match query should be your first choice. It is an advanced full-text query, meaning it knows how to handle full-text segments (Full-text, analyzed ) and exact value fields (Exact-value, not_analyzed ).

Even so, match the main usage scenario for queries is still full-text search. Let's take a simple example to see how the full-text search works.

Index some data

First, we'll create a new index and bulk index some documents through the API:

Delete/my_index put/my_index{"Settings": {"Number_of_shards":1}} post/my_index/my_type/_bulk{The index": {"_id":1}}{The title":"The quick brown fox" }{The index": {"_id":2}}{The title": "The quick brown fox jumps over the lazy dog"} { "Index": { "_id< Span class= "Pl-pds" ": 3}}{" title " The quick brown fox jumps over the quick Dog "} {" Index ": {< Span class= "Pl-pds" > "_id": 4}}{  "Title":  "Brown Fox Brown dog< Span class= "Pl-pds" "}              

Notice that above when creating the index, we set the number_of_shards 1: At a later time the correlation is broken (relevance is broken) section, we will explain why this creates an index with only one primary shard (Primary shard).

Word queries (single word query)

The first example will explain match what happens when you use a query to search for a word in a full-text field:

get/my_index/my_type/_search{    "Query": {        "match" title"quick!"  }}} 

ES executes the above query in the following way match :

    1. Check field type

      titleA field is a full-text string field ( analyzed ), which means that the query string also needs to be parsed.

    2. Parse query string

      The query string is "QUICK!" passed into the standard parser, resulting in a single entry "quick" . Because we get only one entry, the match query uses a term low-level query to execute the query.

    3. Find a matching document

      termThe query queries the inverted index and "quick" then gets the list of documents that contain the entry, in which case the document is 1 2 3 returned.

    4. Score each document

      termThe query calculates its relevance score for each matching document, which is calculated _score by taking into account the frequency of the entry (term Frequency) (the frequency of occurrences in the "quick" field of each document that matches title ), and the frequency of the rewind (inverted document Frequency) (the extent to which the "quick" fields of all documents in the entire index title appear), and the length of each field (shorter fields are considered more relevant) to get. Refer to what are correlations (what are relevance?)

This process will give us the following result (with ellipsis):

"Hits": [ {"_id":"1","_score":0.5,"_source": {The title":"The quick brown fox" } }, {"_id":"3","_score":0.44194174,  "_source": { "Title":  "the Quick brown fox jumps over the quick dog "}}, {" _id" 2 " "_score": 0.3125,  "_source": { "Title":  "The quick brown fox jumps over the lazy dog"}}]    

Document 1 is most relevant because its title field is short, which means that quick it is relatively large in its expressed content. Document 3 is more relevant than document 2 because quick it appears two times.

REFERENCE from:http://blog.csdn.net/dm_vincent/article/details/41693125

[Elasticsearch] Full Text Search (i)-Basic concepts and match queries

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.