Full-Text Search technology

Source: Internet
Author: User
Tags hash html tags json solr kibana logstash filebeat log4j

========================== "Guide" [Start]==========================
Summarize the search engine technology knowledge induction, the work used to ES, to expand knowledge.
========================== "Guide" [End]==========================
1 Full-text search technology

    Full-Text search refers to the computer search program by scanning every word in the article, the index of each word resume,
indicating the number and location of the word in the article. When the user queries, the search program is based on
the pre-built index row lookup, the results are fed back to the user.
2 Inverted Index
    What is an inverted index. The
    value of the property is determined by the record (such as a text content), which is the normal way to index it. The position of the record is determined by the value of the property, which is
called an inverted index. Inverted indexes are generally expressed as a keyword, the frequency with which it appears (that is, the number of occurrences), and the location.
    setting up an inverted index is one of the core key steps of the search engine.
    as you can see from the example below, the keywords are sorted by character order. Lucene can quickly locate keywords using the two-dollar search algorithm
, or binary search tree. For
    Example: There are two article articles
    1:tom lives in Guangzhou,i live in GuangZhou too.
    Article 2:he once lived in Shanghai.
    Set the following inverted index according to the keyword:

3 es and its benefits

    ES is an open source, distributed, restful interface full-text search engine built on Lucene.
Features:
    distributed document database, high availability, high scalability.
    can be extended to hundreds server storage and petabytes of data to search, to
    provide replication mechanism, a server in the cluster is down, the cluster can function properly, downtime server lost data can be restored to other available nodes,
    can be stored in a short period of time, search, Analyze large amounts of data. Real-time search scene performance is excellent.
4 ES Terminology and its concepts
(1) Index Word (term) is an exact value that can be indexed and is searched accurately by term query.
(2) text text is unstructured text that is split into index words and stored in the index library.
Search engines can find out the original text by keyword.
(3) Analysis is the process of converting text to indexed words, and the results of analysis depend on the word breaker. (4) cluster (cluster) cluster consists of one or more nodes, and provides indexing and search services externally. A cluster has a unique name, which defaults to "Elasticsearch".
When multiple nodes set the same cluster name, they are automatically added to the cluster.
The name of the cluster cannot be the same, and a node can join only one cluster.
(5) Node A node is a logically independent service that can store data and participate in cluster indexing and search functions.
(6) Routing (routing) when a document is stored, it is stored in a unique primary shard, which is selected by calculating the hash value. (7) A shard (shard) Shard is a single Lucene instance where the index is a logical space for primary and replica shards, such as 1 billion documents that cannot be stored in a physical machine, and ES can decompose its indexes into multiple shards for storage.
When you create an index, you can define the number of shards you want.
Each shard is a fully functional, stand-alone unit that can be hosted on any node in the cluster. (8) Primary shard (primary shard) A document is first stored in the primary shard and then copied to a different copy.
The default index has 5 primary shards with 1 replica shards, and the number of shards can be set. (9) Replica shards (replica shard) have 0 or more replicas per shard. A replica is a copy of a primary shard.
The aim is to increase high availability, improve performance, and allow horizontal segmentation to expand data. (10) Replication (replica) can transfer faults to ensure the system is highly available.
A replication shard is not stored in the same node.
You can also increase the concurrency, and the search can be executed in parallel on all shards. (11) Index index is a collection of documents that have the same structure. Such as: The index of a customer information contains a product catalog index, an index of the order data, and so on.
The index name is lowercase, and index, search, update, and delete operations can be performed by index name. (12) types (type) can define one or more types in the index.
The type is the logical partition of the index. (13) A document document is a JSON-formatted string stored in ES. Just like a row in a relational database table。 Each document stored in the index has a type and an ID.
The original JSON document is stored in a field called _source, and this field is returned by default when searching for a document.
(14) Mapping (mapping) maps a table structure in a relational database, each of which has a mapping that defines each field type in the index and the setting of an index range.
(15) A field document contains 0 or more fields, which can be either a simple value or an array or a nested structure of an object.
 Fields are similar to the columns of tables in a relational database, and each field corresponds to a field type.
(16) Source field by default, the original document is stored in the _source field, which is the field that the query returns.
 (17) The primary key (ID) ID is a unique identifier of a document, if it is stored without providing an ID, the system will automatically generate an ID, the document's index,type, the ID must be unique.
5 es externally supplied form of interface
1.  provide an interface in HTTP form to external. Available externally through JSON format and rest conventions.
2.  An API that provides friendly object-oriented operations for the Java language.
6 Index
    An index is a collection with the same document structure, and most of ES is done based on an index. Indexes involve their mapping,
indexing settings, monitoring, indexing status, and document operations management.
    Index Analysis: The process of index parsing is done by the parser. The parser is composed of the following 3 functional combinations.
<1> Character filter (character filter): It is able to filter some characters in the converted text. such as removing HTML tags, or converting "&" to "and".
<2> word breaker (tokenizer): It can divide a text string into a single word based on spaces, commas, or even meanings.
<3> Tag filters: Every word goes through its processing. It can modify a word (such as "Hello" to lowercase), remove the word
(such as remove the conjunction "and", "the", etc.), or add words (such as adding synonyms like "jump" and "leap" and so on.) )
7 Mapping
    Mapping is the process of defining document types and fields for storage and indexing. Each document in the index has a type, each of
which has its own mapping. A map defines the data type of each field within the document structure. Mappings define the relationship of the
field type to the metadata associated with the type through configuration. Mapping is an external representation of ES internal results.
8 Search
    Indexes and mappings address only storage problems, and search is the core function of ES.
(1) ES supports rich search search scenarios:
    <1> sort (also includes various aggregation calculations)
    <2> Various conditional filtering
    <3> scripting Support (support for calculation expressions for search results)
(2) The re-scoring mechanism
    es searches for a single word faster, but the search phrase is less efficient. ES provides a re-scoring mechanism to improve search efficiency.
(can be based on data search heat score, attenuation function score, weight score and other strategies)
(3) Rolling query
    es provides a scrolling socialize API to address similar paging query requests.
(4) Support for a feature-rich DSL (domain-specific language domain-specific language).
    such as field query, compound query, connection query, geographic query, span query, highlighting, etc.
9 Aggregation
    Aggregation is a summary based on the search data. Aggregations fall into three main categories.
1. Measure Aggregation: A numeric field is calculated in a set of documents to derive the indicator value.
2. Group aggregation.
3. Pipeline aggregation: This type of aggregation of data sources is the output of other aggregates, and then the calculation of related indicators. Complex nested aggregation operations can be completed.
(1) Metric aggregation
        Average aggregation
        maximum aggregation
        minimum aggregation
        sum aggregation
        statistical aggregation
        percent aggregation
        percent rating aggregation
        highest hit ranking aggregation
        geographic boundary aggregation
        Geographic center of gravity Aggregation

(2) Grouping aggregation
        sub-aggregation
        histogram aggregation
        date histogram aggregation
        time range aggregation
        range aggregation
        filter aggregate
        null aggregation
        nested aggregation
        Index Word Aggregation
        geographic point distance aggregation
        geo-hash grid aggregation

(3) Pipeline aggregation
        average group aggregation
        sum Group aggregation
        maximum Group aggregation
        minimum Group aggregation
        Statistical grouping aggregation
        percentile grouping aggregation
        difference aggregation
Ten ES cluster Management
    This includes the monitoring of ES cluster nodes, the migration of cluster shards, the configuration of cluster nodes, node discovery, and the location of cluster balance
. ES uses clusters to expand nodes horizontally to support the ability to process massive amounts of data.
11 Index word breaker
    In ES, the Index Analysis module can be configured by registering a word breaker. The function of a word breaker is that when a document is indexed, the word
breaker extracts several words from the document to support the storage and search of the index. A word breaker, which consists of a decomposition device and 0 or more
word-element filters. Commonly used are: one yuan participle standardanalyzer, two yuan participle cjkanalyzer, based on the word base of the sub-
word smartchineseanalyzer.
ELK
(1) e refers to Elasticsearch.
(2) L refers to Logstash. is a flexible open source data collection, processing, and transmission tool. Logstash can be used to
log events, unstructured data, and output them, you can export the data into ES.
(3) k refers to Kibana. is an open source data visualization platform that can display data in a powerful and graphical manner.
    in the industry to elasticsearch+logstash+kibana abbreviation Elk. The combination is used to specifically process log data,
store retrieval analysis logs, and display logs.
[1] combination one (Log4j+filebeat+elasticsearch+kibana)
    The Java side logs the log to the file via log4j, Filebeat runs on the Java side server to monitor the log file
changes, and then sends the changed log information directly to the Elasticsearch save through the network. This situation does not require logstash.
[2] combination two (Log4j+filebeat+logstash+elasticsearch+kibana)
    The Java side logs the log to the file via log4j, Filebeat runs on the Java side server to monitor the log file
changes, and then sends the change of log information over the network to Logstash,logstash and then through the network Elasticsearch Save
    Filebeat is a elastic acquisition of a product, pre-acquisition and Logstash are competitive relationships, filebeat lighter
, occupy less resources, but Logstash has the filter function, can filter the analysis log. Therefore, if you do not need to filter the log to
use the same group, otherwise the combination of two.
The es can replace relational databases.
1, ES no transaction, lack of access control.
2, is near real-time, changes can not be made available.
3, some of the more complex data with MySQL such a relational database with SQL is easy to implement, but ES is quite complex.
4, the cost is higher than the database, almost by eating memory to improve performance.
5, ES is just a search engine, suitable for storing some (limited) static data. In Distributed system, ES is used as the front-end static data storage, and the
final data storage is in MySQL. And ES are very low-update data, because ES Update data will cause the entire ES performance is low.
14 Common full-text retrieval techniques using scene selection
(1) Lucene
    Lucene is a full-text search engine toolkit that provides a complete query engine, index engine, and partial text analysis engine.
It is the bottom of solr, ES and so on, it provides the basic functions of data index saving and retrieving, but does not provide the functions of concurrent writing and network interface.
Therefore, it is generally not developed directly with Lucene.
(2) comparison
    of Solr and ES ES is an up-and-comer, solr more mature;
    SOLR has a low performance for the real-time search of "one side index, one side search";
    es for real-time search, performance is better, and ES do cluster simpler,
    no special reason to choose Es;
Appendix: StackOverflow About Performance discussion:
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch
https:// Stackoverflow.com/questions/10213009/solr-vs-elasticsearch

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.