Elk Big Data Query series: Elasticsearch and Logstash Basics

Last Update:2015-09-21 Source: Internet

Author: User

Tags delete cache logstash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elk Big Data Query series: Elasticsearch and Logstash Basics

Recently, some internal projects have used massive data queries and elasticsearch. Due to the small amount of online data, I wrote an article myself, record various problems encountered during setup and solutions for your reference.

0 × 02: Installation

First, install es. Its installation and common plug-ins and configurations are very simple. Here is a brief introduction.

The system I use is in ubuntu12.04.05, And the only requirement of es is to install a newer version of java. The installation is also quite simple. Download the deb installation package (www. elastic. co) on the official website ). After installation, start the service sudo service elasticsearch start to access localhost: 9200.

The es configuration file is in/etc/elasticsearch. yml:
Cluster. name: elasticsearch # Here is the cluster name
Node. name: "Franz Kafka" # node name
# Node. data: true # specify whether the node stores index data
# Path. conf:/path/to/conf # Set the storage path of the configuration file
# Path. data:/path/to/data # Set the storage path of the index data. You can set multiple storage paths separated by commas.
# Path. logs:/path/to/logs # Set the log file storage path
To facilitate visual management of elasticsearch, install the elasticsearch head plug-in and go to the bin directory under the installation directory:
Sudo./plugin-install mobz/elasticsearch-head
Access http: // localhost: 9200/_ plugin/head/

0 × 03: Introduction

Now, the basic installation of elasticsearch has been completed. The following describes the basic structure of elasticsearch to help you have a basic understanding of elasticsearch.
Mysql databases tables Table tables row rows tables column columns
Elasticsearch indexes indices types documents fields

Elasticsearch indexes are confusing. They have three meanings:

1. An index is like a database in mysql. It is the place where data is stored.

2. An index (verb) indicates that a document is stored in an index (noun.

3. Inverted indexes are like adding an index to a specific column in mysql.

Similar to mysql, each database must have a table and a table structure. Each index of es has a type, and each type must have a mapping ). Mapping is like the table structure in mysql. It matches each field to a certain data type (such as string and date ).

You can add a field to an existing ing, but you cannot modify it. If a field already exists in the ing, this may mean that the data of that field has been indexed. If you change the field ing, the indexed data will be incorrect and cannot be correctly searched.

To improve search efficiency, elasticsearch uses inverted indexes for full-text search. Analyzer is used to characterize the text to be analyzed as appropriate terms and then standardize these terms so that they can be easily searched. (For example, fuzzy case and space)

For Chinese word segmentation, the ik plug-in is undoubtedly a better choice.

Download ik: https://github.com/medcl/elasticsearch-analysis-ik

Decompress the package and enter the directory package:
Sudo mvn compile)
Sudo mvn package
Maven

The target/releases directory is generated in the current directory after execution.

Copy elasticsearch-analysis-ik-1.4.0.zip to the new plug-ins/analysis-ik file in the ES directory and decompress the file. You can create a directory without this directory. Copy the config/ik in the elasticsearch-analysis-ik directory to the config directory of ES.

Open config/elasticsearch. yml and add:

Index: analysis: analyzer: ik: alias: [ik_analyzer] type: org. elasticsearch. index. analysis. ikAnalyzerProvider ik_max_word: # splits the text in the most fine-grained manner, exhausting various possible combinations: type: ik use_smart: false ik_smart: type: ik use_smart: true

Restart es to make the ik splitter available.

A good word divider can improve query efficiency, accuracy, and comprehensiveness.

Elasticsearch comes with a default word divider. If you do not need it, you can declare it when creating mapping.

"Index": "not_analyzed ".

0 × 04: import data

Import data. Here I use logstash:

The processing process of logstash is divided into three stages, but it has rich functions:
Input-> Filter-> Output
Import-> process-> output

Data in json, csv, xml, and other formats can be imported, and data can be exported from other databases.

Logstash filters are also very comprehensive. For example, geoip can directly analyze the longitude and latitude of an IP address, geographic location, and other information.

The following is a configuration file instance that is prone to problems:

Input {file {path => "/home/freebuf/Desktop/free. json "# File Location start_position =>" beginning "} filter {json {source =>" message "# This is a required parameter} mutate {remove_field => [" path ", "message", "host", "@ timestamp ", "@ version"] # remove some fields that you do not want to display add_field => {"field" => "value"} # add field} output {elasticsearch {host => localhost index => "database" index_type => "data" protocol => "http" cluster => "yourclustername" # If you modify the set Group name, which must be specified here. }}

1. The json format must be in the following format:
{"USERID": "KAKAO", "TITLE": BUF, "TEXT": "HI "}
{"USERID": "NACY", "TITLE": FREE, "TEXT": "HELLO "}

2. If an error occurs during data import, no data is indexed during re-import because of. sincedb * file, because each time a file is imported, the last byte is stored in sincedb. It is a regular file read through inode. If the file has the same inode number, the program will think it is the same file. You can delete cache files and import them again.

You can use logstash to flexibly import data to elasticsearch.

0 × 05: Query

The most important query function of the database. The elasticsearch header plug-in has many web interface query restrictions and is not flexible in use. We can use its own query to analyze and write the query statement we want.

The data browsing of the head plug-in is a wildcard matching query. The following is a query statement:
{"Fields": ["_ parent", "_ source"], # query the _ source Field in the metadata
"Query": {"bool": # bool query
{"Must": [{"wildcard": {"EMAIL": "nicef *"}], # use wildcards. must indicates that a match is required.
"Must_not": [],
"Shocould": []},
"From": 0, "size": 50, "sort": [], "facets" :{}, "version": true} # A maximum of 50 data records can be displayed.

If a default analyzer, such as @ #, is used, the data cannot be matched after the preceding statement is queried:

{"query":{"bool":{"must":[{"query_string":{"default_field":"data.EMAIL","query":"@163.com"}}],"must_not":[],"should":[]}},"from":0,"size":10,"sort":[],"facets":{}}

You can use string query to find the special characters including Chinese.

Various query conditions allow you to flexibly match the desired content.

Pyes allows you to use ElasticSearch in Python. It provides a wide range of Apis:

For example, use pyes. es. ES and es establish a connection, pyes. managers manages the entire es cluster and can flexibly query BoolQuery (). add_shoshould (TermQuery ("fields", string); and so on.

0 × 06: Conclusion

As a popular search engine, elasticsearch has distributed real-time file storage and powerful real-time analysis and search capabilities. It can be easily expanded to hundreds of servers to process petabytes of data without complex theories, worth it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More