Full-text search engine Elasticsearch getting started tutorial,

Source: Internet
Author: User
Tags relational database table

Full-text search engine Elasticsearch getting started tutorial,

Full-text search is the most common requirement. Open-source Elasticsearch (hereinafter referred to as Elastic) is the first choice for full-text search engines.


It can quickly store, search, and analyze massive data volumes. Wikipedia, Stack Overflow, and Github all use it.

The underlying layer of Elastic is the Open Source library Lucene. However, you cannot directly use Lucene. You must write your own code to call its interface. Elastic is a Lucene package that provides restful APIs for out-of-the-box use.


This article explains how to use Elastic to build your own full-text search engine from scratch. Each step has a detailed description, and you will be able to learn it as you do.


I. Installation


Elastic requires a Java 8 environment. If your machine has not installed Java, refer to this Article. Make sure that the environment variable JAVA_HOME is correctly set.


After installing Java, you can follow the official documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/zip-targz.html) to install Elastic. Directly downloading the compressed package is relatively simple.


$ Wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.zip
$ Unzip elasticsearch-5.5.1.zip
$ Cd elasticsearch-5.5.1/


Next, go to the decompressed directory and run the following command to start Elastic.


$./Bin/elasticsearch


If the error (https://github.com/spujadas/elk-docker/issues/92) "max virtual memory areas vm. maxmapcount [65530] is too low" is reported at this time, run the following command.


$ Sudo sysctl-w vm. max_map_count = 262144


If everything works properly, Elastic runs on the default port 9200. Open another command line window and request the port.


$ Curl localhost: 9200

{
"Name": "atntrTf ",
"Cluster_name": "elasticsearch ",
"Cluster_uuid": "tf9250xhq6ee4policyi11ana ",
"Version ":{
"Number": "5.5.1 ",
"Build_hash": "19c13d0 ",
"Build_date": "2017-07-18T20: 44: 24.823Z ",
"Build_snapshot": false,
"Inclue_version": "6.6.0"
},
"Tagline": "You Know, for Search"
}


In the code above, when port 9200 is requested, Elastic returns a JSON object containing information about the current node, cluster, and version.


Press Ctrl + C to stop running Elastic.


By default, Elastic only allows local access. If you need remote access, you can modify the config/elasticsearch directory of the Elastic installation directory. yml file, remove network. host annotation, change its value to 0.0.0.0, and then restart Elastic.


Network. host: 0.0.0.0


In the code above, set it to 0.0.0.0 so that anyone can access it. Do not set this for online services. Set it to a specific IP address.


Ii. Basic Concepts
2.1 Node and Cluster


Elastic is essentially a distributed database that allows multiple servers to work together. Each server can run multiple Elastic instances.


A single Elastic instance is called a node ). A group of nodes constitute a cluster ).


2.2 Index


Elastic indexes all fields and writes an Inverted Index after processing ). When searching for data, you can directly search for the index.


Therefore, the top-level unit of Elastic data management is called Index ). It is a synonym for a single database. The name of each Index (that is, the database) must be in lowercase.


The following command can view all indexes of the current node.


$ Curl-x get 'HTTP: // localhost: 9200/_ cat/indices? V'


2.3 Document


A single record in the Index is called a Document ). Many documents constitute an Index.

Document is in JSON format. The following is an example.


{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "Database Management"
}


Document in the same Index should not have the same structure (scheme), but it is best to keep the same, which is conducive to improving the search efficiency.


2.4 Type


Document can be grouped. For example, the Index weather can be grouped by city (Beijing and Shanghai) or by climate (Sunny or rainy ). This Type of grouping is called Type, which is a virtual logical grouping used to filter documents.


Different types should have similar structures (schema). For example, the id field cannot be a string in this group, but a value in another group. This is a difference from a relational database table. Data with different properties (such as products and logs) should be saved into two indexes instead of the two types in one Index (although this can be done ).


The following command lists the types contained in each Index.


$ Curl: 'localhost: 9200/_ mapping? Pretty = true'


As planned (https://www.elastic.co/blog/index-type-parent-child-join-now-future-in-elasticsearch), Elastic 6.x allows only one Type for each Index, and 7.x removes the Type completely.


3. Create and delete Indexes


Create an Index and send a PUT request directly to the Elastic server. The following example creates an Index named weather.


$ Curl-X put' localhost: 9200/weather'


The server returns a JSON object. The acknowledged field indicates that the operation is successful.


{
"Acknowledged": true,
"Shards_acknowledged": true
}


Then, we send a DELETE request to DELETE the Index.


$ Curl-x delete 'localhost: 9200/weather'


Iv. Chinese Word Segmentation settings


First, install the Chinese word segmentation plug-in. Here, ik is used. You can also consider other plug-ins (such as smartcn ).


$./Bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip


The above code is installed with the plug-in version 5.5.1, Which is used with Elastic 5.5.1.


Then, restart Elastic to automatically load the newly installed plug-in.


Create an Index to specify the fields to be segmented. This step varies according to the data structure. The following commands are only for this article. Basically, all Chinese fields to be searched must be set separately.


$ Curl-x put 'localhost: 9200/accounts'-d'
{
"Mappings ":{
"Person ":{
"Properties ":{
"User ":{
"Type": "text ",
"Analyzer": "ik_max_word ",
"Search_analyzer": "ik_max_word"
},
"Title ":{
"Type": "text ",
"Analyzer": "ik_max_word ",
"Search_analyzer": "ik_max_word"
},
"Desc ":{
"Type": "text ",
"Analyzer": "ik_max_word ",
"Search_analyzer": "ik_max_word"
}
}
}
}
}'


In the above Code, first create an Index named accounts, which contains a Type named person. Person has three fields.


  • User

  • Title

  • Desc


The three fields are both Chinese and all types are text. Therefore, you need to specify a Chinese Word divider. You cannot use the default English word divider.


The analyzer of Elastic. We specify a word divider for each field.


"User ":{
"Type": "text ",
"Analyzer": "ik_max_word ",
"Search_analyzer": "ik_max_word"
}


In the code above, analyzer is the word divider of the field text, and search_analyzer is the word divider of the search term. The ik_max_word word divider is provided by the plug-in ik. You can perform the maximum number of Word Segmentation on text.


V. Data Operations
5.1 add record


If you send a PUT request to the specified/Index/Type, you can add a record to the Index. For example, if you send a request to/accounts/person, you can add a new person record.


$ Curl-x put 'localhost: 9200/accounts/person/1'-d'
{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "Database Management"
}'


The server returns the Index, Type, Id, Version, and other information.


{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "1 ",
"_ Version": 1,
"Result": "created ",
"_ Shards": {"total": 2, "successful": 1, "failed": 0 },
"Created": true
}


If you look at it carefully, you will find that the Request Path is/accounts/person/1, and the last 1 is the Id of the record. It is not necessarily a number, and any string (such as abc) can.


When adding a record, you can also do not specify an Id. In this case, you need to change it to a POST request.


$ Curl-x post 'localhost: 9200/accounts/person'-d'
{
"User": "Li Si ",
"Title": "engineer ",
"Desc": "System Management"
}'


In the above Code, a POST request is sent to/accounts/person to add a record. In this case, the _ id field in the JSON object returned by the server is a random string.


{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "AV3qGfrC6jMbsbXb6k1p ",
"_ Version": 1,
"Result": "created ",
"_ Shards": {"total": 2, "successful": 1, "failed": 0 },
"Created": true
}


Note: If you do not create an Index first (accounts in this example) and directly execute the preceding command, Elastic will not report an error, but directly generate the specified Index. Therefore, be careful when typing. Do not write the wrong Index name.


5.2 view records


Send a GET request to/Index/Type/Id to view the record.


$ Curl: 'localhost: 9200/accounts/person/1? Pretty = true'


The above code requests to view the/accounts/person/1 record. The URL parameter pretty = true indicates that the record is returned in readable format.


In the returned data, the found field indicates that the query is successful, and the _ source Field returns the original record.


{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "1 ",
"_ Version": 1,
"Found": true,
"_ Source ":{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "Database Management"
}
}


If the Id is incorrect, no data is found. The found field is false.


$ Curl 'localhost: 9200/weather/beijing/abc? Pretty = true'

{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "abc ",
"Found": false
}


5.3 delete records


To DELETE a record is to send a DELETE request.


$ Curl-x delete 'localhost: 9200/accounts/person/1'


Do not delete this record, which will be used later.


5.4 update records


An update record uses a PUT request to resend data.


$ Curl-x put 'localhost: 9200/accounts/person/1'-d'
{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "database management, software development"
}'

{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "1 ",
"_ Version": 2,
"Result": "updated ",
"_ Shards": {"total": 2, "successful": 1, "failed": 0 },
"Created": false
}


In the above Code, we changed the raw data from "Database Management" to "database management, software development ". Several fields have changed in the returned results.


"_ Version": 2,
"Result": "updated ",
"Created": false


As you can see, the record Id is not changed, but the version is changed from 1 to 2, the operation type (result) is changed from created to updated, and the created field is changed to false, because this time the record is not created.


6. Data Query
6.1 return all records


If you use the GET method to directly request/Index/Type/_ search, all records will be returned.


$ Curl 'localhost: 9200/accounts/person/_ search'

{
"Took": 2,
"Timed_out": false,
"_ Shards": {"total": 5, "successful": 5, "failed": 0 },
"Hits ":{
"Total": 2,
"Max_score": 1.0,
"Hits ":[
{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "AV3qGfrC6jMbsbXb6k1p ",
"_ Score": 1.0,
"_ Source ":{
"User": "Li Si ",
"Title": "engineer ",
"Desc": "System Management"
}
},
{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "1 ",
"_ Score": 1.0,
"_ Source ":{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "database management, software development"
}
}
]
}
}


In the above Code, the took field of the returned result indicates the time consumed for the operation (unit: milliseconds), The timed_out field indicates whether the operation has timed out, And the hits field indicates the hit record. The meaning of the face field is as follows.


  • Total: number of returned records. In this example, two records are returned.

  • Max_score: The maximum matching degree. In this example, It is 1.0.

  • Hits: an array of returned records.


In the returned record, each record has a _ score field, indicating the matching program. By default, this field is sorted in descending order.


6.2 full-text search


Elastic queries are very special, using your own query syntax (https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl.html), requiring GET requests to carry data bodies.


$ Curl 'localhost: 9200/accounts/person/_ search'-d'
{
"Query": {"match": {"desc": "software "}}
}'


The above Code uses the Match query (https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-match-query.html), the specified Match condition is that the desc field contains the word "software. The returned results are as follows.


{
"Took": 3,
"Timed_out": false,
"_ Shards": {"total": 5, "successful": 5, "failed": 0 },
"Hits ":{
"Total": 1,
"Max_score": 0.28582606,
"Hits ":[
{
"_ Index": "accounts ",
"_ Type": "person ",
"_ Id": "1 ",
"_ Score": 0.28582606,
"_ Source ":{
"User": "Zhang San ",
"Title": "engineer ",
"Desc": "database management, software development"
}
}
]
}
}


Elastic returns 10 results at a time by default. You can change this setting using the size field.


$ Curl 'localhost: 9200/accounts/person/_ search'-d'
{
"Query": {"match": {"desc": "manage "}},
"Size": 1
}'


The code above specifies that only one result is returned at a time.


You can also use the from field to specify the displacement.


$ Curl 'localhost: 9200/accounts/person/_ search'-d'
{
"Query": {"match": {"desc": "manage "}},
"From": 1,
"Size": 1
}'


The code above specifies that, starting from position 1 (starting from position 0 by default), only one result is returned.


6.3 logical operations


If multiple search keywords exist, Elastic considers them as an or relationship.


$ Curl 'localhost: 9200/accounts/person/_ search'-d'
{
"Query": {"match": {"desc": "Software System "}}
}'


The above code searches for software or systems.


If you want to perform an and search for multiple keywords, you must use a Boolean query (https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-bool-query.html ).


$ Curl 'localhost: 9200/accounts/person/_ search'-d'
{
"Query ":{
"Bool ":{
"Must ":[
{"Match": {"desc": "software "}},
{"Match": {"desc": "system "}}
]
}
}
}'


VII. Reference Links


  • ElasticSearch official manual (https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)

  • A Practical Introduction to Elasticsearch (https://www.elastic.co/blog/a-practical-introduction-to-elasticsearch)


Source: http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html


Copyright statement: the content source network is copyrighted by the original creator. Unless confirmed, we will mark the author and the source. If there is any infringement, please inform us that we will immediately delete it and apologize. Thank you.


-END-


Architecture Digest

ID: ArchDigest

Internet application architecture Middleware Architecture Technology large websites middleware Big Data Middleware Machine Learning

For more exciting articles, click below: Read the original article

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.