Hive Streaming Append ORC file

Last Update:2016-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

When storing business data, as the business grows, the data stored on the HDFS by the Hive table increases over time and is stored in HDFs in text format, which consumes a huge amount of resources. Well, we need to have a way to reduce the cost of capacity. In Hive, there is an ORC file format that can significantly reduce the cost of storage capacity. Today, I will share with you how to implement streaming data to the Hive ORC table.

2. Content 2.1 ORC

Here, we first need to know what the ORC is in Hive. Prior to this, there was an RC file in Hive, and the presence of the Orc, which was optimized for RC files, provides an efficient way to store hive data, using ORC files to provide hive read-write and performance. The advantages are as follows:

Reduced load on NameNode
Support for complex data types (such as list,map,struct, etc.)
The file contains an index
Block compression
...

The structure chart (from Apache ORC official website) is as follows:

Here the author does not list, more details, you can read the official website introduction: [Entry Address]

2.2 Use

Knowing the structure of the Orc file, and how we go about using the ORC table, let's create an example of a table with a Stream record, as shown below:

Create Table int , msg string)       by (Continent string, country string)      Clustered  by  into 5 Buckets       as Orc tblproperties ("transactional"=// are for streaming

It is important to note that when using streaming, creating ORC tables requires the use of a split bucket.

Below, we try to insert the data to simulate the streaming process, as shown in the code below:

String DbName = "testing"= "Alerts"; ArrayListNew arraylist<string> (2);p Artitionvals.add ("Asia");p Artitionvals.add ( "India"= "Org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe";   New Hiveendpoint ("thrift://x.y.com:9083", DbName, Tblname, partitionvals);

If there are multiple partitions, we can store the partitions in the partition collection and load them. Here, you need to turn on the Metastore service to ensure that Hive's Thrift service is available.

//-------Thread 1-------//Streamingconnection connection = Hiveep.newconnection (true);D elimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", ENDPT); Transactionbatch Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 1-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("1,hello streaming". GetBytes ()); Txnbatch.write ("2,welcome to streaming". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 1-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("3,roshan Naik". GetBytes ()); Txnbatch.write ("4,alan Gates". GetBytes ()); Txnbatch.write ("5,owen O ' Malley". GetBytes ()); Txnbatch.commit (); Txnbatch.close (); Connection.close ();} Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 2-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("6,david Schorow". GetBytes ()); Txnbatch.write ("7,sushant Sowmyan". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 2-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("8,ashutosh Chauhan". GetBytes ()); Txnbatch.write ("9,thejas Nair"getBytes ()); Txnbatch.commit (); Txnbatch.close ();} Connection.close ();

Next, we write the streaming data to the ORC table for storage. The implementation results are as follows:

3. Case studies

Below, we are going to complete a case where there is a scenario where a lot of business data is escalated to the specified server each day, and then a staging service forwards the individual business data to the respective log nodes by business splitting, and then the ETL service will put the data into the Hive table. Here, we only talk about the flow of inbound hive table, get the data, processed, into the ORC table hive. The concrete implementation code looks like this:

/*** @Date * * @Author smartloli * * @Email [Email protected] * * @Note TODO*/ Public classIploginstreamingextendsThread {Private Static FinalLogger LOG = Loggerfactory.getlogger (iploginstreaming.class); PrivateString Path = "";  Public Static voidMain (string[] args)throwsException {string[] paths= Systemconfigutils.getpropertyarray ("Hive.orc.path", ",");  for(String str:paths) {iploginstreaming Iplogin=Newiploginstreaming (); Iplogin.path=str;        Iplogin.start (); }} @Override Public voidrun () {List<String> list = Fileutils.read ( This. Path); LongStart =System.currenttimemillis (); Try{write (list); } Catch(Exception e) {log.error ("Write path[" + This. Path + "] ORC have error,msg is" +e.getmessage ()); } System.out.println ("Path[" + This. Path + "] spent [" + (System.currenttimemillis ()-start)/1000.0 + "s]"); }     Public Static voidWrite (list<string>list)throwsConnectionerror, Invalidpartition, invalidtable, partitioncreationfailed, impersonationfailed,  Interruptedexception, ClassNotFoundException, Serializationerror, Invalidcolumn, streamingexception {String dbName = "Default"; String tblname= "Ip_login_orc"; ArrayList<String> partitionvals =NewArraylist<string> (1);        Partitionvals.add (Calendarutils.getday ()); String[] FieldNames=NewString[] {"_bpid", "_gid", "_plat", "_tm", "_uid", "IP", "latitude", "Longitude", "Reg", "Tname" }; Streamingconnection Connection=NULL; Transactionbatch Txnbatch=NULL; Try{hiveendpoint Hiveep=NewHiveendpoint ("thrift://master:9083", DbName, Tblname, partitionvals); hiveconf hiveconf=Newhiveconf (); Hiveconf.setboolvar (HiveConf.ConfVars.HIVE_HADOOP_SUPPORTS_SUBDIRECTORIES,true); Hiveconf.set ("Fs.hdfs.impl", "Org.apache.hadoop.hdfs.DistributedFileSystem"); Connection= Hiveep.newconnection (true, hiveconf); Delimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", Hiveep); Txnbatch= Connection.fetchtransactionbatch (10, writer); //Batch 1txnbatch.beginnexttransaction ();  for(String json:list) {string ret= ""; Jsonobject Object=Json.parseobject (JSON);  for(inti = 0; i < fieldnames.length; i++) {                    if(i = = (fieldnames.length-1) ) {ret+=object.getstring (Fieldnames[i]); } Else{ret+ = Object.getstring (Fieldnames[i]) + ",";            }} txnbatch.write (Ret.getbytes ());        } txnbatch.commit (); } finally {            if(Txnbatch! =NULL) {txnbatch.close (); }            if(Connection! =NULL) {connection.close (); }        }    }}

PS: It is recommended to use multithreading to process data.

4. Preview

The implementation results are as follows:

Partition details

Number of records under this partition

5. Summary

When using Hive streaming to achieve ORC append, in addition to the table itself needs to distinguish between barrels, the project itself depends on the complex, will design Hadoop Hive and other project dependencies, recommend the use of MAVEN engineering to implement, from the MAVEN project to help us solve the various JAR dependencies between packages.

6. Concluding remarks

This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!

Hive Streaming Append ORC file

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Streaming Append ORC file

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive Streaming Append ORC file

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support