Accelerate the start and run experience of multiple data
Characteristics of machine data
In the 1th part of this series: Speed up machine data analysis, you learned how machine data is made up of records. In many cases, records contain only one row, and in others, many rows form a record together. The machine log that contains the exception stack trace, the XML content, or the content generated from an application that writes multiple lines of records is a typical example. The record boundary is usually recognized by the existence of a master timestamp. In records, there are times when some characters appear before the main timestamp.
See part 1th: Speed up the initial preparation of machine data analysis, known log types and unknown log types, and learn about some of these examples.
Correctly identifying and defining these record boundaries is an important first step in performing machine data analysis. Whether the machine data contains one row or more than one line of records, following processes can help determine the master timestamp, which is the key to the rest of the analysis.
Because of the diversity of data, rules that describe record boundaries or master timestamps may be slightly different or need to be redefined. With the help of tools, you can simplify the preparation of multiple types of tasks.
Before the start of this series
One of the main advantages and strengths of IBM Accelerator for Machine Data Analytics is the ability to easily configure and customize the tool. The articles and tutorials in this series are intended for readers who want to get a sense of the accelerator, further speed up machine data analysis, and want to gain customized insights.
About this tutorial
This tutorial is a step-by-step example that demonstrates how to use the IBM Infosphere biginsights tool (Web or Eclipse) to speed up the startup and running experience of IBM Accelerator for Machine Data analysis. You'll learn how to easily prepare the data and test the extraction of the data over and over again. This lays the groundwork for the remaining analysis. In this process, you will introduce some additional tools to speed up this process.
Goal
In this tutorial:
You will learn how to configure machine data for profiling. You will introduce the Biginsights Eclipse Accessibility tool, which you can use selectively.
If you prefer to configure and test data locally and then move to the Biginsights cluster, you will learn how to use the Eclipse tool to perform this task.
If you prefer to configure and test directly in the Biginsights cluster, you can learn how to perform this task.
Because there are a variety of data for analysis, use the following steps for a small amount of data to prepare for analysis. Once tested, the analysis can be run on large data with a similar configuration.
Prerequisite
Read the 1th part of this series: Speed up machine data analysis and get an overview of IBM Accelerator for Machine data Analytics. You can choose to read part 2nd of this series: Accelerate the analysis of new log types, learn how to use the Eclipse tool to support new log types, and part 3rd: Speed Machine Data Search to learn how to search for known and custom log types from a consolidated searchable repository.
System Requirements
To run the examples in this tutorial, you need to:
Infosphere biginsights 2.0 has been installed
IBM Accelerator for Machine Data Analytics has been installed
Biginsights 2.0 tools for Eclipse are already installed (optional)
A dataset used for machine data analysis. For links to download data, see the Downloads section.
The case of a fictitious Sample outdoors company
Sample Outdoors's data scientists have accepted the task of promoting IBM Accelerator for Machine data Analytics to a large number of new organizations, each with its own log format. They expect to prepare a variety of logs for the analysis. They decided to use the Biginsights tool to speed up data preparation and testing for analysis. After they are ready, they use these configurations for regular, ongoing analysis.
Speed up the start-up and running experience of machine data analysis
In the previous tutorials and articles in this series, prepared batches of data have been used and are available for download. In this tutorial, you will prepare a batch of data. The work of preparing batches includes identifying record boundaries and master timestamps, and creating rules to define them. This information is then used to create Meta data for the batch process. Finally, you will test the prepared batch.
Here are the steps to follow in this article:
View the process and identify the record boundaries.
If necessary, use the Biginsights Eclipse tool to provide the first plan. It represents the string before the main timestamp. If you do not need tools to help build a regular expression, or if you are not interested in using the Eclipse tool, continue to the next step to provide the second rule.
Provides a second rule. It represents the master timestamp.
Put the rules together to form the metadata for this type of log.
If you choose to test small amounts of data locally on eclipse and then move the data to the Biginsights cluster, use Eclipse to test the rules locally for small data.
See tips for using the Eclipse tool for iterative testing and troubleshooting.
If you choose to test small data on the biginsights cluster, use the Biginsights console for small data test rules.
See tips for using the Biginsights console for iterative testing and troubleshooting.
Know the inside.
Run on large data.
At Sample Outdoors Co.
Sample Outdoors's data scientist obtains machine data from front-end applications through the Web tool Group as an exercise in the use of tools. Next, they want to prepare the data for analysis.
Identify record boundaries
The record boundary contains two parts:
The master timestamp should be provided in the Java SimpleDateFormat.
The string before the main timestamp should be provided in the form of a regular expression.
We used an example of an Apache Web Access log to help review the process.
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/