Hive-based Log Data Statistics

Source: Internet
Author: User
Tags define function
Conversion from http://blog.csdn.net/suine/article/details/5653137

1. Hive Introduction

Hive is an open-source hadoop-based data warehouse tool used to store and process massive structured data. It stores massive data in the hadoop file system instead of the database, but provides a data storage and processing mechanism for database-like databases, and uses hql (SQL-like) the language automatically manages and processes the data. We can regard the volume of structured data in hive as one table, which is actually stored in HDFS in a distributed manner. Hive parses and converts statements to generate a series of hadoop-Based Map/reduceTask to complete data processing by executing these tasks.

Hive was born from Facebook's log analysis requirements. In the face of massive structured data, hive has completed tasks that previously required large-scale databases at a low cost, and the learning threshold is relatively low, flexible and efficient application development.

Hive has released its first official stable version 0.3.0 since 9.4.29. However, it is gradually improving over a year. There are quite a few materials available on the Internet, especially fewer Chinese materials, this article explores the application of hive in combination with the business, and summarizes these experiences. The so-called lessons learned from the past hope that readers can take less detours.

Refer to the official hive wiki here:

Http://wiki.apache.org/hadoop/Hive

The official homepage is here:

Http://hadoop.apache.org/hive/

Hive-0.5.0 source code package and binary release package

Http://labs.renren.com/apache-mirror/hadoop/hive/hive-0.5.0/

2. Deployment

As hive is a hadoop-based tool, the deployment of hive requires a normal hadoop environment. The following describes the simple deployment and Application of hive.

Deployment environment:

Operating System: Red Hat Enterprise Linux as Release 4 (nahant update 7)

Hadoop: hadoop-0.20.2, normal operation

The deployment steps are as follows:

1. download the latest release package hive-0.5.0-dev.tar.gz and upload it to the namenode node of hadoop. decompress the package to obtain the hive directory. Assume the path is/opt/hadoop/hive-0.5.0-bin.

2. Set the environment variable hive_home to point to the hive root directory/opt/hadoop/hive-0.5.0-bin. Because hadoop is running, check whether the environment variables java_home and hadoop_home are correct and valid.

3. Switch to the $ hive_home directory and configure hive by default. Run bin/hive to start hive. If hive is started normally, the "hive>" prompt will appear.

4. Enter "show tables;" in the command prompt. If it runs properly, it indicates that the deployment is successful and can be used.

FAQs:

1. Run the "show tables;" command to prompt "failed: Error in metadata: Java. lang. illegalargumentexception: URI: does not have a scheme ", which is because hive cannot find the database that stores the metadatabase, modifying hive in the conf/hive-default.xml configuration file. metaStore. local is true. Hive places the metadata of structured data in a third-party database and sets this parameter to true. Hive creates a local derby database to store metadata. Of course, if necessary, you can use a third-party database such as MySQL to store metadata. However, the configuration value of hive. MetaStore. Local should be false.

2. If you already have a set of running systems, and you do not want to deploy a hadoop environment separately, you can directly use the hadoop environment that comes with nutch1.0, however, such a deployment will cause hive to fail to run normally, prompting that some methods cannot be found. This is because nutch1.0 uses the commons-lang-2.1.jar package, and hive needs to be a commons-lang-2.4.jar, download a 2.4 version of the package to replace the 2.1, nuch and hive can run normally.

Iii. application scenarios

This article describes how to use hive. The business is not the key. It briefly introduces the business scenario. This task is to perform statistical analysis on search log data.

The group search was launched soon, and the log volume was not large. These logs are distributed on five front-end machines and saved on an hourly basis. The data generated in the last hour is synchronized to the log analysis machine on an hourly basis. The statistical data must be updated on an hourly basis. These statistical items, including keyword search PV, category visits, and access TPS per second.

Based on hive, we create a table on a daily basis, based on the timestamp, the background script combines the log data of the Five front-end machines synchronized each hour into a log file and imports it to the hive system, log Data synchronized hourly is appended to the data table of the current day. After the data is imported, the statistical items of the current day are recalculated and the statistical results are output.

If the preceding requirements are directly developed based on hadoop, you need to manage data on your own, develop different MAP/reduce computing tasks for multiple statistical requirements, and customize operations such as merging and sorting, and check the task running status, the workload is not small. However, using hive, from import to analysis, sorting, deduplication, and result output, these operations can all be solved using hql statements. After processing a statement, it is parsed into several tasks for running, even the complex requirement that requires simultaneous access to data for multiple days, such as the increment of the keyword access volume, can be automatically completed using statements such as table Association, saving a lot of work.

Iv. Hive practices

When using hive for the first time, it should be said that it is quite quick to get started. The SQL statements provided by hive are very similar to those provided by MySQL statements, with many similar syntaxes. This makes it very convenient for us to get started, but we must write these statements with ease, you also need to have a good understanding of hive to write exquisite statements in combination with hive features.

For detailed hive syntax, refer to the official wiki language manual:

Http://wiki.apache.org/hadoop/Hive/LanguageManual

Although the syntax style provides us with convenience, there are still a lot of problems encountered during the first use. The following describes the problems we encountered in the business scenario and the hive function customization.

1. Separator Problems

The first problem is the log data delimiter. The general format of our log data is as follows:

00:00:02 @ $ _ $ @ qq2010 @ $ _ $ @ All @ $ _ $ @ nokia_1681c @ $ _ $ @ 1 @ $ _ $ @ 10 @ $ _ $ @ $ _ $@-1 @ $ _ $ @ 10 @ $ _ $ @ application @ $ _ $ @ 1

The delimiter "@ $ _ $ @" is visible from the format to prevent data confusion caused by the same characters in the log body as much as possible. Hive supports specifying custom delimiters when creating tables. However, after multiple tests, it is found that only single-character custom delimiters are supported, separators such as "@ $ _ $ @" cannot be supported, but we can solve this problem by customizing the separators. The internal delimiter of hive is "/001 ", replace the separator with "/001.

After exploration, we found that there are two ways to solve this problem.

A) custom outputformat and inputformat.

Hive's outputformat/inputformat is similar to hadoop's outputformat/inputformat. inputformat is used to format the input data and then provide it to hive, outputformat is responsible for re-formatting the data output by hive into the target format and then outputting the data to the file. This method of format customization is relatively low-level and it is relatively simple to customize it, override the next method in the recordreader class in inputformat. The sample code is as follows:

Public BooleanNext (longwritable key, byteswritable value)

ThrowsIoexception {

While(Reader. Next (Key, text )){

String strreplace = text. tostring (). tolowercase (). Replace ("$ @", "/001 ");

Text txtreplace =NewText ();

Txtreplace. Set (strreplace );

Value. Set (txtreplace. getbytes (), 0, txtreplace. getlength ());

Return true;

}

Return false;

}

Override the write method in recordwriter in hiveignorekeytextoutputformat. The sample code is as follows:

Public voidWrite (writable W)ThrowsIoexception {

String strreplace = (text) W). tostring (). Replace ("/001", "@ $ _ $ @");

Text txtreplace =NewText ();

Txtreplace. Set (strreplace );

Byte[] Output = txtreplace. getbytes ();

Byteswritable. Set (output, 0, output. Length );

Writer. Write (byteswritable );

}

After customizing outputformat/inputformat, you must specify outputformat/inputformat when creating a table. The following example shows how to use outputformat and inputformat:

Stored as inputformat 'com. aspire. Search. loganalysis. hive. searchloginputformat 'outputformat' com. aspire. Search. loganalysis. hive. searchlogoutputformat'

B) Use serde (serialize/deserialize) to format data during data serialization and deserialization.

This method is a little more complex, and the data control capability is also weaker. It uses regular expressions to match and process data, and the performance will also be affected. However, serdeproperties allows you to customize the attributes of a table. In serde, you can customize more attributes.

2. Import and export data

A) compatibility with multi-version log formats

Because hive is mainly used to process cold data (read-only but not write), it only supports batch data import and export, and does not support writing or updating a single piece of data, therefore, if the data to be imported has some nonstandard rows, We need to customize some extended functions to process them.

There are multiple versions of the log data to be processed. The data content of each field in each version is different. The second column of log data of Version A may be a search keyword, however, the second column of Version B is the terminal type of the search. If the logs of these two versions are directly imported into hive, it is obvious that the data will be chaotic and the statistical results will not be correct. Our task is to make log data of multiple versions coexist in the hive data warehouse, and the input/output operations of the table can be mapped to the correct fields of the correct log version.

Here, we don't care about this part of tedious work. We only care about the key points of technical implementation. Where can we implement this function so that hive can recognize the data in different formats? After many attempts, it will be complicated to adapt the version at any stage. The final work is still the most elegant in inputformat/outputformat. After all, inputformat is the source and outputformat is the final destination. Specifically, it is in the next method of inputformat mentioned above and in outputformatWrite method to complete the adaptation.

B) operate on local data through hive

In the beginning, local data is always first transmitted to HDFS, Then hive operates the data on HDFS, and then the data is uploaded from HDFS back to the local data. Later, it was found that this was not necessary. The Hive statements all provided the "local" keyword, supporting direct import of data from the local machine to hive, and direct export of data from hive to the local machine, however, the data on HDFS is used for Internal Computing, but the data is automatically imported and exported for us.

3. Data Processing

There is nothing special about the statistical processing of log data, that is, some SQL statements and no advanced skills. However, I still want to list some statement examples, it is convenient for hive to process data and shows some hive usage.

A) Add User-Defined Functions for hive, all of which are in the hive_contrib.jar package.

Add JAR/opt/hadoop/hive-0.5.0-bin/lib/hive_contrib.jar;

B) Calculate the search volume of each keyword, sort the search volume in descending order, and save the result to the keyword_20100603 table.

Create Table keyword_20100603 as select keyword, count (keyword) as count from searchlog_20100603 group by keyword order by Count DESC;

C) Calculate the search volume of each type of user terminal, sort the search volume in descending order, and save the result to the device_20100603 table.

Create Table device_20100603 as select device, count (device) as count from searchlog_20100603 group by Device Order by Count DESC;

D) Create the table time_20100603, use the custom inputformat and outputformat, and specify the actual storage location of the table data in '/loganalysis/results/time_20100603' (HDFS path ), instead of In Hive's own data directory

Create external table if not exists time_20100603 (Time string, Count INT) stored as inputformat 'com. aspire. search. loganalysis. hive. xmlresultinputformat 'outputformat' COM. aspire. search. loganalysis. hive. xmlresultoutputformat 'location'/loganalysis/results/time_20100603 ';

E) Calculate the traffic TPS per second, sort by traffic in descending order, and output the result to the time_20100603 table. We have just defined this table above, the actual location is '/loganalysis/results/time_20100603', and the file content is in XML format due to xmlresultoutputformat formatting.

Insert overwrite table time_20100603 select time, count (time) as count from searchlog_20100603 group by time order by Count DESC;

F) calculate the maximum, minimum, and average values of the response time of each search request.

Insert overwrite table response_20100603 select max (responsetime) as Max, min (responsetime) as Min, AVG (responsetime) as AVG from searchlog_20100603;

G) Create a table to store the keyword search volume, increment, and Increment Ratio of today and yesterday. The table data is in '/loganalysis/results/keyword_20100604_20100603' and the content will be in XML format.

Create external table if not exists keyword_20100604_20100603 (keyword string, Count int, increment int, incrementrate double) stored as inputformat 'com. aspire. search. loganalysis. hive. xmlresultinputformat 'outputformat' COM. aspire. search. loganalysis. hive. xmlresultoutputformat'
Location '/loganalysis/results/keyword_20100604_20100603 ';

H) set the table attributes so that xmlresultinputformat and xmlresultoutputformat can output XML files of different formats according to different content of output. resulttype.

Alter table keyword_20100604_20100603 set tblproperties ('output. resulttype' = 'keyword ');

I) associate today's keyword statistical result table (keyword_20100604) with yesterday's keyword statistical result table (keyword_20100603) to count the number of times the keywords appear today and yesterday at the same time, today, the incremental and incremental ratios are arranged in descending order according to the incremental ratio. The results are output to the keyword_20100604_20100603 table just defined. The data file content is in XML format.

Insert overwrite table keyword_20100604_20100603 select cur. keyword, cur. count, cur. count-yes.count as increment, (cur. count-yes.count)/Yes. count as incrementrate from keyword_20100604 cur join keyword_20100603 yes on (cur. keyword = yes. keyword)
Order by incrementrate DESC;

J)

 

4. User-Defined Function UDF

Some statistical results must be output in CSV format. For files whose bodies are all valid content, the header containing version, encoding, and other information does not need to be like XML, user define function (UDF) is the most suitable.

UDF functions can be directly applied to select statements. After formatting the query structure, output the content. Custom UDF must be inheritedOrg.apache.hadoop.hive.ql.exe C. UDFAnd implementEvaluateFunction,EvaluateFunctions Support overloading and variable parameters. We have implemented a UDF that supports variable string parameters and can convert any number of different types of data obtained by the SELECT statement to a string and output the data in CSV format. Because the code is relatively simple, here is an example of the source code:

PublicString evaluate (string... STRs ){

Stringbuilder sb =NewStringbuilder ();

For(IntI = 0; I <STRs. length; I ++ ){

SB. append (convertcsvfield (STRs [I]). append (',');

}

SB. deletecharat (sb. Length ()-1 );

ReturnSB. tostring ();

}

Note that to use the UDF function, in addition to implementing a custom UDF, you also need to add a package containing the UDF. For example:

Add JAR/opt/hadoop/hive-0.5.0-bin/lib/hive_contrib.jar;

Create a temporary method, for example:

Create temporary function result2csv as 'com. aspire. Search. loganalysis. hive. result2csv ';

The drop method is required after use, for example:

Drop temporary function result2csv;

5. Output statistics in XML format

The preceding figure shows that some log statistics are output to a table. xmlresultinputformat and xmlresultoutputformat are used to format the data into an XML file. To create this table, we only need to implement xmlresultoutputformat, if SELECT query is also supported, we also need to implement xmlresultinputformat. Here we only introduce xmlresultoutputformat.

As mentioned above, we only need to rewrite write to customize xmlresultoutputformat. This method will format the multi-field data separated by '/001' in hive into the XML format we need, the simplified sample code is as follows:

Public voidWrite (writable W)ThrowsIoexception {

String [] strfields = (text) W). tostring (). Split ("/001 ");

Stringbuffer sbxml =NewStringbuffer ();

If(Strresulttype. Equals ("keyword ")){

Sbxml. append ("<record> <keyword>"). append (strfields [0]). append (

"</Keyword> <count> "). append (strfields [1]). append ("</count> <increment> "). append (strfields [2]). append (

"</Increment> <rate>"). append (strfields [3]). append (

"</Rate> </result> ");

}

Text txtxml =NewText ();

Byte[] Strbytes = sbxml. tostring (). getbytes ("UTF-8 ");

Txtxml. Set (strbytes, 0, strbytes. Length );

Byte[] Output = txtxml. getbytes ();

Byteswritable. Set (output, 0, output. Length );

Writer. Write (byteswritable );

}

TheStrresulttype. Equals ("keyword ")Specify the keyword statistical result. This attribute is used by the following statement to specify the result type. With this attribute, we can use the same outputformat to output multiple types of results.

Alter table keyword_20100604_20100603 set tblproperties ('output. resulttype' = 'keyword ');

By taking a closer look at the implementation of the write function, we can find that only the text of the XML file is output here. Where is the output of the XML file header and end tag? Fortunately, we use the outputformat-based implementation. We can output the end tag in the close () method by outputting the file header information such as version and encoding in the constructor.

This is why we do not use udfs to output results. Custom udfs cannot output the file header and end, and cannot output the complete format of XML data, only files with valid data such as CSV can be output.

V. Summary

Hive is a highly scalable data warehouse tool. With the help of the hadoop distributed storage computing platform and hive's ability to understand SQL statements, most of our work is to adapt the input and output data, these two Io formats are constantly changing. We only need to customize our own input and output adapters. hive will transparently store and process the data, greatly simplifying our work. The focus of this article is also here. I believe that everyone who is doing data analysis will face it and hope it will be helpful to you.

This article introduces a simple log statistics practice based on hive. The application of hive is still relatively simple and can still meet the requirements. For some complex data analysis tasks, the experience described above may not be enough, or even hive cannot do it. Hive also has many advanced functions, which are not covered in this article, we will elaborate on the specific tasks in the future.

If you have any suggestions or suggestions for this article, please comment. Thank you.

In addition, please indicate the source for reprinting.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.