Machine data may have many different formats and volumes. Weather sensors, health trackers, and even air-conditioning devices generate large amounts of data that require a large data solution. &http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; However, how do you determine what data is important and how much of that information is valid, Is it worth being included in the report or will it help detect alert conditions? This article will introduce some of the challenges of supporting the use of a large number of machine datasets, as well as solutions using large data technologies and Hadoop. Before exploring the basic mechanism of data storage and provisioning, you need to consider what information you want to store, how to store it, and how long you intend to store it.
A larger, but not always raised, problem with Hadoop is that it provides a data store that can only be attached to store large amounts of information. Although this approach sounds like a good fit for storing machine data, it can entice people to store large amounts of information over time. This poses a problem, not because Hadoop cannot store the data, but because it adds an unnecessary burden to an environment that requires real-time, effective information.
Therefore, it requires careful management when using Hadoop to store machine data. You cannot store the data and assume that you can retrieve the data you need later, and you need to make a plan. For example, in order to use this data to issue real-time alerts, you may not want to sift through years of data points to pick up the latest details. In addition to helping build trends and baseline information, data from two years ago, or even from 6 months ago, is unlikely to be useful when identifying a problem and sending an alert.
Select data to store
Make an intentional selection of the data you want to store. Ask yourself the following questions and make a plan for the data to be stored and the length of storage.
How much data do you intend to store?
To determine this amount of data, you can estimate the size of your records and the time interval at which the data arrives. Based on this information, you have a good idea of how much data is created, how many data points are stored, and how much information is used daily or weekly. For example, a 3-field data unit (such as date, time, and data points), although small, is recorded every 15 seconds, creating approximately KB of data. Multiply this value by 2000 machines, and the MB data is generated daily. If you record multiple data points, the amount of data increases further.
How long do you want to store the previous data?
Ask yourself two questions: how long before you want to be able to view all data points, how long before you want to be able to determine the trend? A single, up-to-date data from a week ago is not very useful because the need for that information is small when the problem is resolved.
With that in mind, you need to set a benchmark. A datum is a data point or matrix that represents a normal operation. With a baseline, it is easier to identify the trend or peak of an exception.
How do you want to determine the benchmark?
In relation to the preceding question, the baseline is the comparison value that you store to determine when a new data exceeds the normal level. For example, you might record disk space and set the threshold to a specific percentage or hard data that indicates that the amount of free disk space is too small.
A benchmark can be one of the following three basic types:
existing benchmark: If you are monitoring machine data, the benchmark for comparison may already exist. For example, a fan running at RPM has a fixed benchmark that will not be constantly adjusted. Controlled benchmarks: for controlled, and then monitored, components and systems, the baseline is determined by comparing the controlled value to the reported value. For example, a temperature control system has a set of problems and a reported temp. HVAC may be set to 5°c and the report temperature is set to 4.8°c, so in this example, the set (required) temperature becomes the benchmark for the actual data for comparison. Historical benchmark: This type of benchmark applies to systems that calculate baselines based on existing values. For example, monitoring the height or velocity of a river is not based on the situation you know, but on what you have previously identified and experienced.
Historical benchmarks are clearly changing, and historically benchmarks will not normally be set as hard data, except in rare cases. Instead, the historical benchmark should be a variable that sets historical benchmarks based on the sensor information and values you have.
For example, over time, you might add disk space, or the time and reason the application needs information may change. Benchmarking comparisons with a percentage may not work. When you have ten terabytes of storage, 5% is considered a very low storage, and there is no problem with the percent used. But when you have TB to TB storage, 5% is 2.5 TB, so large numbers may not be used as warning thresholds.
Benchmarks need to be computed by looking at past values. Therefore, you need to determine how much data you want to compare, and how long your previous data is. Do you compare the new data with the data from last week, last month, or last 6 months?
Would you like to paint and report historical data?
You can store and generate graphical representations of information, but as with basic storage, you cannot return to a specific minute or second. However, you may want to record the minimum, maximum, and average for every 15 minutes of the day in order to generate chart information.
After considering these issues, analyze how to troubleshoot the underlying problems of storing data, how to determine the exceptions that trigger alerts and notifications, and how to handle the amount of information generated during the process.
Storing data in Hadoop
The first question is how to store data in Hadoop. Because this topic is beyond the scope of this article, see the links in resources for more information.
In general, Hadoop is not a good place to be used as a real-time database for such information and storage. Although Hadoop is a practical solution for attaching data to a system, it is a better choice to use a near-linear SQL database to store the data.
But don't let this Hadoop restriction hinder you. A useful way to load data into a database is to use a permanent write to the Hadoop Distributed file System (HDFS) by attaching to an existing open file. You can also use CSV's only supporting attributes to record batches of data. Hadoop can be used as a concentrator.
One of the difficulties encountered here is to record all the different information in a single file for a period of time, and then copy the information to HDFS for processing. You can also write directly to the HDFS last file that you can access from Hive or HBase.
In Hadoop, many small files are less efficient and useful than small amounts of large files. Larger files can be distributed more efficiently across the cluster. As a result, data can be better distributed across different nodes in the cluster. This distribution makes it more efficient to process the information through MapReduce because more nodes have different blocks of that data.
It is more efficient to organize or group information from multiple data points into a small amount that spans longer periods of time, such as all data in a single day, or in larger files that cover multiple hosts, machines, or other major data collectors.
This method generates a file named after collectors, machines, and possible dates, each of which is in a different row. For example: SiteID, Collectorid, datetime, Settemp, Reportedtemp.
It is important to ensure that the data is widely distributed throughout the system. For example, for a 30-node Hadoop cluster, you might want data to be effectively dispersed across the cluster without wasting CPU or disk I/O during processing. This distribution gets the fastest processing and response time, which is important when you want to use the data for monitoring and alerting purposes.
These files can be attached by a single concentrator, which collects data from multiple hosts and writes the information to these larger files.
Decomposing data in this way also means that you can effectively partition data based on host and/or date. This distribution makes the analysis and processing of information more efficient, because tools such as Hive can limit the processing of specific data by these partitions.
Generating datums using Hive
If there are no hard or controlled benchmarks, the first task is to create baseline and statistical data to determine the normal state.
This information may be constantly changed, so you need to be able to determine the baseline over time by analyzing existing information. In Hive, you can analyze this information by using the appropriate query to generate minimum, maximum, or average statistics over a period of time.
As shown in Listing 1, this information can be computed based on the reported values of temperature sensor data from multiple sensors and hosts. Summarize data by date.
Listing 1. Generate minimum/maximum/average statistics for a period of time using Hive
Hive> Select To_date (From_unixtime (timelogged)), Host,machine,min (Reportval), Max (Reportval), AVG (Reportval) from Temp GROUP BY To_date (From_unixtime (timelogged)), Host,machine; ... 2014-03-05 1 1 2 14 7.9957529165098652014-03-05 1 2 2 14 7.9923660018278582014-03-05 1 3 2 8.0037094779850552014-03-05 14 1 4 14 8.004480045875672014-03-05 2 1 2 14 8.0065050266114732014-03-05 2 2 2 14 7.9999103990824872014-03-05 2 3 2 14 7.9940684192606132014-03-05 2 4 2 14 7.9936741752235542014-03-06 1 1 2 14 8.002255049349632014-03-06 1 2 2 14 7.9883979345055242014-03-06 1 3 2 14 7.990032028237142014-03-06 1 4 2 14 7.9923851232106672014-03-06 2 1 2 14 7.9997058631283092014-03-06 2 2 2 14 7.9752271390286952014-03-06 2 3 2 14 8.0164716648146932014-03-06 2 4 2 14 7.990849075102948Time taken:22.655 seconds
This code provides some hard data for comparison. As you can see here, for example, all sensors have a range of 2C to 14C, but the average is 7 C to 8 C. This data provides important data that can be used as a basis for the decision to perform some warning notifications during normal operations.
In this example, you summarize the data for that data point that is recorded, but if you record the data more frequently, you can summarize the data in hours or minutes. You can also group by different elements-this example assumes that each sensor and host are separate, but in some cases the sensor may get different readings of the same static element that need to be put together.
To avoid each recalculation, this data can be written to a new table to use as a base for comparison when identifying a problem in the input data stream. You can use an INSERT into to complete this task using the output of the query in Listing 2.
Listing 2. Using INSERT into
INSERT INTO table baselines Selectto_date (From_unixtime (timelogged)), Host,machine,min (Reportval), Max (reportval , AVG (reportval) from temp GROUP by to_date (From_unixtime (timelogged)), Host,machine;
For continuous checking, calculate the current value that you think is the most recent. Examine the entire table and compute a value that is correct in the entire dataset for the appropriate sensor and host. You can delete a datum by date, as shown in Listing 3.
Listing 3. Delete Datum by date
Select Host,machine,min (Reportval), Max (Reportval), AVG (reportval) from temp Group by Host,machine; ... Job 0:map:1 Reduce:1 Cumulative cpu:2.76 sec HDFS read:13579346 HDFS write:215 successtotal MapReduce CPU time Spent:2 S. 760 msecOK1 1 2 14 7.9980555780604391 2 2 14 7.9909607527690651 3 2 14 7.9988657538685891 4 2 14 8.000196756981982 1 2 14 8.004097 1748012172 2 2 14, 7.9911690836911612 3 2 14 8.0020022916401422 4 2 14 7.992673695906297
You can calculate other values, such as standard deviations or calculated percentages. You can also calculate a specific point that exceeds these values and create the benchmark you want.
Now you have a baseline to use. Next, we will try to identify when a problem occurs.
Identify real time problems
If your benchmark is embedded in the data, to determine an entry that exists outside of the configured parameters, you need to compare two values: SELECT * from temp where temp.setval!= temp.reportval;.
There is an interval or fault-tolerant value that performs this comparison by executing a more complex query, as shown in Listing 4.
Listing 4. Complex queries
SELECT * from temp where (Temp.reportval <= (TEMP.SETVAL-2)) or (Temp.reportval >=);
If you have datum data, you can use that information to perform a merge, based on the average of the host and machine in one day, as shown in Listing 5.
Listing 5. Perform a merge
SELECT * FROM temp join baselines on (To_date (From_unixtime (temp.timelogged)) = baselines.basedate and Temp.machine = Baselines.machine and temp.host = baselines.host) where Reportval > Baselines.logavg;2 1394089343 4 7 9 2014-03-06 2 4 2 14 7.9908492 1394082857 4 10 11 2014-03-06 2 4 2 14 7.9908492 1394085272 4 9 12 2014-03-06 2 4 2 14 7.9908492 1394091433 4 10 12 2014-0 3-06 2 4 2 14 7.9908492 1394082209 4 11 9 2014-03-06 2 4 2 14 7.9908492 1394087662 4 9 10 2014-03-06 2 4 2 14 7.9908492 1394083754 4 10 9 2014-03-06 2 4 2 14 7.990849
For each machine, temperature sensor, and corresponding date, this output generates all rows that exceed the calculated datum from the original source data. The sample value may be more than you expected. Calculates the actual value for a more restrictive version, for example, 10% more than the calculated value.
Listing 6 compares all the benchmarks.
Listing 6. Compare to all baselines
SELECT * FROM temp join baselines_all on (Baselines_all.machine = temp.machine and baselines_all.host = temp.host) where Reportval > Baselines_all.logavg;
For alert data, perform a check to output a specific message when an error occurs, as shown in Listing 7.
Listing 7. Alert data
CREATE TABLE errors as Select Temp.host,temp.machine,temp.reportval,if (Reportval > Baselines_all.logavg, ' overtemp ' , ' Normal ') from the temp join Baselines_all on (Baselines_all.machine = temp.machine and baselines_all.host = Temp.host);
You can now query the table, determine the overtemp entry, and drill down on the data.
You can now query for real-time information. Alerts can be triggered based on previously determined restrictions and outliers. Because this information contains only Hive queries, it can be accessed in a wide range of environments and applications, in the form of reports and traceable elements.
Archive old information and create summary information
To archive information (a necessary step to perform comparisons and build baselines), you need to compress the data. After compressing the data, the package contains only critical static information (the average of the current state), which is used in contrast to entries outside the average, and can then be labeled accordingly.
This method needs to generate an adjusted form of a datum query that uses the minimum and maximum values to summarize this information and also contains the exception values that fall outside the information.
How much data is stored and how often it is stored depends entirely on you. Make sure that the data that can be rolled up is stored, which generates information and values that are outside the data.
The sample graphic in Figure 1 is generated based on sensor data similar to the previous example.
Figure 1. Sample diagram
The information between the two lines (red at the top and green at the bottom) is irrelevant data. This information is useful because it provides an average of two extremes that can be used to determine the temperature. But within that time period, you can get the average of that time period and show that it's roughly the same.
After the rollup, extract the summary results, ignore the data points that you have determined, and then create a table that summarizes the external data.
Use an INSERT command to complete this operation, summarizing only the data that falls between the elements, as shown in Listing 8.
Listing 8. Insert
CREATE TABLE temparchive as Select To_date (From_unixtime (temp.timelogged)), temp.host,temp.machine,temp.reportval,0 From temp join Baselines_all on (Baselines_all.machine = temp.machine and baselines_all.host = Temp.host) where (reportval < baselines_all.logavg+1) and (Reportval > Baselines_all.logavg-1) GROUP by To_date (From_unixtime ( temp.timelogged)), Temp.host,temp.machine, Temp.reportval;
This code creates a table that contains data that is summarized first by date and then by machine and sensor. The table contains the readings for the report, and the last column always contains 0. This table tells you that the data is safe to use as a benchmark because the calculated value is contained within a previously computed baseline. You specify only the basic difference between the values in the 1C range of the baseline average.
Performs the reverse operation to obtain a value other than the datum previously calculated for that date/machine/sensor combination. In this case, you can extract all the values instead of the summarized information, as shown in Listing 9.
Listing 9. Extract all Values
INSERT INTO table temparchive Select temp.timelogged,temp.host,temp.machine,temp.reportval,1 from temp join Baselines_all on (baselines_all.machine = temp.machine and baselines_all.host = Temp.host) where (Reportval > baselines _all.logavg+1) and (Reportval < baselines_all.logavg-1);
These entries are raw data that is outside the base mean, with a value one tag in the last column. With this tag, you can selectively exclude or include these values when calculating long-term baseline data based on the journal information that is summarized and archived.
Concluding
When using and processing raw machine data, it is the simplest problem to get information and store it in Hadoop. You need to determine what the information represents and how you want to summarize and report the data.
After you have raw data and can run queries against them in Hive, you also need to calculate your benchmark (if they haven't been repaired). Then run a query that first determines the baseline, and then queries it to find data outside these baseline limits. This article describes some of the basic techniques involved in handling and identifying these real-time exceptions to identify the errors that need to be reported and warned or depicted, and then exposed to the management application.