Hunk/Hadoop: Best Performance practices
Whether or not Hunk is used, there are many ways to run Hadoop that cause occasional performance. Most of the time, people add more hardware to solve the problem, but sometimes the problem can be solved simply by changing the file name.
Run the Map-Reduce task [Hunk]
Hunk runs on Hadoop, but this does not necessarily mean effective use. If Hunk runs in "complex mode" instead of "intelligent mode", it will not actually use Map-Reduce. Instead, it will directly pull all Hadoop data to the Splunk engine and process it there.
HDFS storage [hadoop]
How to deploy files when many items of Hadoop are associated with Hunk? You need to include the timestamp in the file path. Hunk can use the directory structure as a filter, which can greatly reduce the volume of data pulled to Splunk.
The timestamp in the file name can also take effect, but the effect is poor, because Hunk still reads all file names.
For better performance, you can include a key-value pair in the file path. For example, /2015/3/2/app = webserver /..." . When traversing the directory, the query command filters out key-value pairs, reducing the data volume pulled to Splunk again.
Timestamp-based VIX/indexs. conf [hunk]
When the file storage mode is applicable to any Hadoop Map-Reduce, You need to modify indexs. conf so that Hunk can recognize the directory structure.
File Format [Hunk]
Self-describing files such as JSON and CSV can be easily read by Hunk. They are more detailed and eliminate costly ing operations.
Compression type/File Size [Hdaoop]
Avoid too large files, such as files with mb gz compression and no fragments. (For example, LZO compressed multipart files are acceptable .) For files without sharding, there is a one-to-one ing relationship between the core and the file, which means that only one core can be used to handle large files, while other fixed cores can only be idling and waiting. That is to say, it takes a lot of time to process files without sharding, so the Map-Reduce task cannot be processed quickly.
Similarly, you should avoid using a large number of broken files ranging from dozens of KB to hundreds of KB. If the file is too small, you will spend a lot of time starting and managing tasks, rather than actually processing data.
Report acceleration [hunk]
Hunk can now use the report acceleration feature of Splunk to cache search results in HDFS, reducing or eliminating the need to read data from the master Hadoop cluster.
Before you enable this function, make sure that your Hadoop cluster has enough space to store cache.
Hardware [Hadoop]
Make sure you have the right hardware. Although Hadoop can run on or even a dual-core laptop, to use it, you still need to have at least four CPUs for each node, to ensure sufficient space for temporary storage, you must configure at least 12 GB of memory, two local disks (10 K or solid state)
Search for Head Clustering [Hunk]
Search Head Clustering is a relatively new feature in Splunk 6.2. In Splunk6.3, Hunk-based queries are a feasible attribute.
You may also like the following articles about Hadoop:
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition