ElasticSearch term and match query mechanism parsing and hidden query problems

Source: Internet
Author: User
Tags bitset
2. Questions about the default analysis using term queries
  Previously said ES default parser will be divided into a single man, the search conditions "internal medicine" will be analyzed as "Inside" and "section", thus searching. For search our common match search is similar to the database Fuzzy query, term search for accurate query. When used, the following conditions occur:
2.1 Scenes

By default, when you do not mapping a field under an index, you are using the default parser, assuming the following data content:

Internal Department of Internal Medicine, internal and
Secondary Department,
general internal Medicine
Oncology

We use match search to search for "tumors", the results are clearly "oncology", Search "oncology", get not only "oncology", "internal" and other several include "Inside", "branch" will also be listed, but when we use term search "tumor", What we get is empty results, which is also obvious, because term is the exact query, "tumour" and "oncology" are different. So if we search for "oncology" in the term, it is theoretically possible to have "oncology" in order to meet our expectations. But:

{"Query": {"term
        ": {"name": "Cancer Medicine"}
    }
}
<-----------    result    ----------- >
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "Skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "Max_score": null,
        "hits": []
    }
}

The results are still empty, we change the search conditions, enter "internal medicine", the resulting is still empty. However, when we enter any word in "swollen" or "oncology", we can get the result of "oncology", and at the same time, "Inside" and "section" will have all the data containing the word "Inside" and "section". 2.2 Analysis Reasons

The reason is mentioned in the previous section, the default parser will speak Chinese analysis of a single word, there will be no word, that is, 4 words of "oncology" inverted index is "swollen", "tumor", "Inside", "section", using the term query, must be the above four words in one of the lines. So this is not the same as the previously mentioned search conditions also need to carry out corresponding analysis, according to the inverted index analysis of the row query contradiction. Here we need to understand the match query and the term query mechanism. The 2.2.1 Match query principle checks the field type.
The Name field is a string type (analyzed) of the parsed full-text segment, which means that the query string itself should also be parsed. Here the "oncology" will be analyzed as "swollen", "tumor", "Inside", "section" Analysis query string.
The query's string "oncology" is passed into the standard parser, and the result is a single item. Because there is only one word entry, the match query executes a single underlying term query. Here is the "swollen", "tumor", "Inside", "section" Four words for the term query to find matching documents.
Use term queries to find "swollen", "tumour", "Inside", "branch" in the inverted index, and then get a set of documents containing the item, the result of this example is documentation: "Internal Medicine", "one branch", "Internal second branch", "General Internal Medicine", "Oncology medicine", "Internal medicine". Score each document.
Calculate each document relevance score by term query _score, which is the frequency of the word frequency (term frequency, "swollen", "tumour", "Inside", "section" in the Name field of the relevant document) and the reverse document frequency (inverse documents Freque Ncy, which is the frequency at which the word "swollen", "tumour", "Inside", "section" appears in the Name field of all documents, and the length of the field (that is, the shorter the relevance of the field), the combination of the method. 2.2.2 Term Query principle

One of the first things to note is that the term query is a non-scoring query, and match is a scoring query, and secondly, when the word query, will not be participle, and match query will be with the cover section has been configured with the analyzer for the corresponding analysis. The internal operations of the query are:

Find matching documents.
The term query finds "oncology" in the inverted index and then gets all the documents that contain the term. In this case, there is obviously no match.

Remember that the "oncology" within index is not a direct "tumor internal medicine" exists, but to "swollen", "tumor", "Inside", "section" Four words exist . Create a bitset.
The filter creates a bitset (an array containing 0 and 1) that describes which document contains the term. The flag bit for the matching document is 1. In this example, the value of Bitset is [0,0,0,0,0]. Internally, it is represented as a "roaring bitmap" that can efficiently encode sparse or dense collections at the same time. Iterative Bitset (s)
Once bitsets is generated for each query, Elasticsearch loops through the bitsets to find a collection of matching documents that meet all the filtering criteria. The order of execution is heuristic, but in general it iterates over sparse bitset (because it can exclude a large number of documents). Increment usage count
Elasticsearch is able to cache non-scoring queries for faster access, but it will also be less clever at caching things that are rarely used. Non-scoring calculations because the inverted index is fast enough, we only want to cache queries that we know will be reused in the future to avoid wasting resources.
So this explains why the term "oncology" is used to get the amount of "oncology", and the list can be used to search for problems. 2.3 Solution

Through the interpretation of the default parser, in fact, regardless of the default or self-setting shingle, and so on, for men of the term query, there is no standard way, using shingle configuration, can meet the terms of the search "oncology" to get "oncology" results, but use " Oncology, the same "inside", which depends on the maximum length and length of shingle in the mapping when the index is created. So, for a text search, the term query is not the same as in MySQL. However, for non-literal types, such as bool and numeric types, or greater than or less than the comparison operation, term is very precise.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.