Introduction
If you use elasticsearch to store your logs, this article provides you with some practices and suggestions.
If you want to collect logs from multiple hosts to elasticsearch, you have the following options:
- Graylog2 is installed on a central machine. Then it inserts logs into elasticsearch, and you can use its beautiful search interface ~
- Logstash has many features, including what logs you can input, how to transform and filter, and where to output them. There are two types of outputs to elasticsearch, namely direct output and river mode through rabbitmq.
- Apache flume can also obtain logs from massive data sources, use "decorators" to modify logs, and use various "Sinks" to store your output. We are related to the elasticflume sink.
- Output module of omelasticsearch rsyslog. You can directly output data to elasticsearch through rsyslog on your application server, or use rsyslog to transmit data to the central server to insert logs. Or, you can combine the two. For specific settings, see rsyslog wiki.
- Customized solution. For example, you can write a script to transfer your logs to elasticsearch from a server in the north of the sky.
The best configuration varies depending on your settings. However, there are always a few useful guidelines to recommend:
Memory and number of opened files
If your elasticsearch runs on a dedicated server, your experience is to allocate half of the memory to elasticsearch. The other half is used for system caching, which is also important.
You can change this setting by modifying the es_heap_size environment variable. Change this variable to your expected value before starting elasticsearch. The other one selects the es_java_opts variable of the elasticsearch instance, which is passed in the startup script (elasticsearch. In. Sh or elasticsearch. BAT. You must find the-XMS and-xmx parameters, which are the minimum and maximum memory allocated to the process. We recommend that you set it to the same size. Well, es_heap_size is actually what it does.
Make sure that the file descriptor limit is large enough for your elasticsearch. The recommended value is between 32000 and 64000. For more information about how to set this restriction, see.
Number of directories
An optional method is to store all logs in one index, and use TTL field to ensure that the logs are deleted. However, when your log volume is large enough, this may be a problem, because using TTL will increase the overhead and it takes too long to optimize this huge and unique index, these operations are resource-intensive.
The recommended method is to create a directory based on time. For example, the directory name can be the time format of the YYYY-MM-DD. The interval depends entirely on how long the log is to be retained. If you want to keep it for a week, it would be nice to have a directory on that day. If you want to keep it for one year, it may be better to keep one directory for one month. There should not be too many directories, because the overhead will also increase during full-text search.
If you choose to store your directories by time, you can narrow down your search scope to the relevant directories. For example, if most of your searches are about recent logs, you can provide a "quick search" option on your own interface to retrieve only the most recent directories.
Rotation and Optimization
Removing old logs becomes abnormal and simple after a time-based directory:
$ curl -XDELETE 'http://localhost:9200/old-index-name/'
This operation is very fast and close to deleting a small number of files of the same size. You can put it in crontab in the middle of the night.
Optimizing indices is a good thing to do during off-peak hours. Because it can increase your search speed. We recommend that you do this especially when you create directories based on time. Except for the current directory, it will not be changed any more. You only need to optimize these old directories once and for all.
$ curl -XPOST 'http://localhost:9200/old-index-name/_optimize'
Sharding and Replication
You can configure your own settings for each directory through elasticsearch. yml or the rest API. For details, see the link.
What is interesting is the number of shards and copies. By default, each directory is divided into five parts. If more than one node exists in the cluster, each shard will have a copy. That is to say, each directory has a total of 10 shards. When a new node is added to the cluster, the shards are automatically balanced. Therefore, if you have a default directory and 11 servers in the cluster, one of them will not store any data.
Each slice is a Lucene index, so the smaller the slice, the less new data that elasticsearch can put into the slice. If you split the Directory into more shards, the insertion speed is faster. Note that if you use a time-based directory, you only insert logs into the current directory, and other old directories will not be changed.
Too many shards make it difficult-in terms of space usage and search time. So you need to find a balance between your insertion volume, search frequency, and hardware conditions.
On the other hand, replication helps your cluster still run when some nodes are down. The more copies, the smaller the number of nodes that must run online. Replication is also useful when searching-more replication results in faster searches, while increasing the index creation time. Because of the modification to the parts of the pigs, more copies are required.
Ing _ source and _ all
Mappings defines how your documents are indexed and stored. You can, for example, define the type of each field-for example, in your syslog, the message must be a string, and the severity can be an integer. For how to define mappings, see the link.
Ing has a reasonable default value. The field type is automatically detected when the first document in the new directory is inserted. However, you may want to control this by yourself. For example, there may be only one number in the message field of the first record in the new directory, so it is detected as a long integer. When the next 99% of logs are strings, elasticsearch will not be able to index them. It will only record an error log saying that the field type is incorrect. In this case, you need to explicitly manually map "message": {"type": "string "}. For details about how to register a special ing, refer to the link.
When you use a time-based directory name, creating an index template in the configuration file may be more appropriate. For more information, see the link. Apart from your ing, you can define other directory attributes, such as the number of partitions.
In ing, you can select the _ source of the compressed document. This is actually the whole line of log-So enabling compression can reduce the index size and rely on your settings to improve performance. Experience is that when you are limited by the memory size and disk speed, compressing source files can significantly increase the speed. On the contrary, if you are limited by the CPU computing power, it will not work. For more information about the source field, see the link.
By default, in addition to creating indexes for all your fields, elasticsearch also puts them into a new field named _ all for indexing. The advantage is that you can search for something in _ all that you don't care about. On the other hand, more CPU resources are used to create an index and increase the index size. So if you don't need this feature, turn it off. Even if you use it, you 'd better consider defining which fields are included in _ all. For more information, see the link.
Refresh Interval
After a document is indexed, elasticsearch is almost real-time in a sense. Before you search for a document, the index must be refreshed. By default, directories are automatically refreshed asynchronously every second.
Refresh is a very expensive operation, so if you increase this value slightly, you will see a very significant increase in the insert rate. The increase depends on how much your users can accept.
You can save the expected interval value in your index template. Or save it in the elasticsearch. yml configuration file.
Another solution is to disable automatic refresh and set it to-1. Then use rest API to manually refresh. It is very effective when you need to insert massive logs in one breath. However, you usually use two methods: Refresh after each bulk insertion or refresh before each search. This will delay their own operation response.
Thrift
Generally, the rest interface uses the HTTP protocol, but you can replace it with thrift faster. You need to install transport-Thrift plugin and ensure that the client supports this. For example, if you are using the pyes Python client, you only need to change the connection port from 9200 that supports HTTP by default to 9500 that supports thrift by default.
Asynchronous replication
Generally, an index operation will be returned only after all parts (including copies) have completed the index of the document. You can use the index API to set replication to asynchronous to run replication operations in the background. You can directly use this API, or use existing clients (such as pyes or rsyslog omelasticsearch.
Use filters to replace requests
Generally, when you search for logs, you are interested in sorting by time sequence rather than scoring. In this scenario, scoring is very irrelevant. Therefore, it is more appropriate to use a filter to find logs than to use requests. Because the filter does not perform scoring and can be automatically cached. For more details, see the link.
Batch Index
We recommend that you use the bulk API to create an index, which is much faster than creating an index for a log at a time.
There are two main considerations:
- Optimal Batch Size. It depends on a lot of your settings. For the starting value, refer to the default value in pyes, that is, 400.
- Set a timer for batch operations. If you add a log to the buffer and wait for its size to trigger a limit to start batch inserts, you must add a timeout limit as a supplement to the size limit. Otherwise, if your log volume is small, you may see a huge latency from log publishing to elasticsearch.