Hive Streaming Append ORC file

Source: Internet
Author: User

1. Overview

When storing business data, as the business grows, the data stored on the HDFS by the Hive table increases over time and is stored in HDFs in text format, which consumes a huge amount of resources. Well, we need to have a way to reduce the cost of capacity. In Hive, there is an ORC file format that can significantly reduce the cost of storage capacity. Today, I will share with you how to implement streaming data to the Hive ORC table.

2. Content 2.1 ORC

Here, we first need to know what the ORC is in Hive. Prior to this, there was an RC file in Hive, and the presence of the Orc, which was optimized for RC files, provides an efficient way to store hive data, using ORC files to provide hive read-write and performance. The advantages are as follows:

    • Reduced load on NameNode
    • Support for complex data types (such as list,map,struct, etc.)
    • The file contains an index
    • Block compression
    • ...

The structure chart (from Apache ORC official website) is as follows:

Here the author does not list, more details, you can read the official website introduction: [Entry Address]

2.2 Use

Knowing the structure of the Orc file, and how we go about using the ORC table, let's create an example of a table with a Stream record, as shown below:

Create Table int , msg string)       by (Continent string, country string)      Clustered  by  into 5 Buckets       as Orc tblproperties ("transactional"=// are for streaming

It is important to note that when using streaming, creating ORC tables requires the use of a split bucket.

Below, we try to insert the data to simulate the streaming process, as shown in the code below:

String DbName = "testing"= "Alerts"; ArrayListNew arraylist<string> (2);p Artitionvals.add ("Asia");p Artitionvals.add ( "India"= "Org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe";   New Hiveendpoint ("thrift://x.y.com:9083", DbName, Tblname, partitionvals);

If there are multiple partitions, we can store the partitions in the partition collection and load them. Here, you need to turn on the Metastore service to ensure that Hive's Thrift service is available.

//-------Thread 1-------//Streamingconnection connection = Hiveep.newconnection (true);D elimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", ENDPT); Transactionbatch Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 1-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("1,hello streaming". GetBytes ()); Txnbatch.write ("2,welcome to streaming". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 1-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("3,roshan Naik". GetBytes ()); Txnbatch.write ("4,alan Gates". GetBytes ()); Txnbatch.write ("5,owen O ' Malley". GetBytes ()); Txnbatch.commit (); Txnbatch.close (); Connection.close ();} Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 2-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("6,david Schorow". GetBytes ()); Txnbatch.write ("7,sushant Sowmyan". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 2-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("8,ashutosh Chauhan". GetBytes ()); Txnbatch.write ("9,thejas Nair"getBytes ()); Txnbatch.commit (); Txnbatch.close ();} Connection.close ();

Next, we write the streaming data to the ORC table for storage. The implementation results are as follows:

3. Case studies

Below, we are going to complete a case where there is a scenario where a lot of business data is escalated to the specified server each day, and then a staging service forwards the individual business data to the respective log nodes by business splitting, and then the ETL service will put the data into the Hive table. Here, we only talk about the flow of inbound hive table, get the data, processed, into the ORC table hive. The concrete implementation code looks like this:

/*** @Date * * @Author smartloli * * @Email [Email protected] * * @Note TODO*/ Public classIploginstreamingextendsThread {Private Static FinalLogger LOG = Loggerfactory.getlogger (iploginstreaming.class); PrivateString Path = "";  Public Static voidMain (string[] args)throwsException {string[] paths= Systemconfigutils.getpropertyarray ("Hive.orc.path", ",");  for(String str:paths) {iploginstreaming Iplogin=Newiploginstreaming (); Iplogin.path=str;        Iplogin.start (); }} @Override Public voidrun () {List<String> list = Fileutils.read ( This. Path); LongStart =System.currenttimemillis (); Try{write (list); } Catch(Exception e) {log.error ("Write path[" + This. Path + "] ORC have error,msg is" +e.getmessage ()); } System.out.println ("Path[" + This. Path + "] spent [" + (System.currenttimemillis ()-start)/1000.0 + "s]"); }     Public Static voidWrite (list<string>list)throwsConnectionerror, Invalidpartition, invalidtable, partitioncreationfailed, impersonationfailed,  Interruptedexception, ClassNotFoundException, Serializationerror, Invalidcolumn, streamingexception {String dbName = "Default"; String tblname= "Ip_login_orc"; ArrayList<String> partitionvals =NewArraylist<string> (1);        Partitionvals.add (Calendarutils.getday ()); String[] FieldNames=NewString[] {"_bpid", "_gid", "_plat", "_tm", "_uid", "IP", "latitude", "Longitude", "Reg", "Tname" }; Streamingconnection Connection=NULL; Transactionbatch Txnbatch=NULL; Try{hiveendpoint Hiveep=NewHiveendpoint ("thrift://master:9083", DbName, Tblname, partitionvals); hiveconf hiveconf=Newhiveconf (); Hiveconf.setboolvar (HiveConf.ConfVars.HIVE_HADOOP_SUPPORTS_SUBDIRECTORIES,true); Hiveconf.set ("Fs.hdfs.impl", "Org.apache.hadoop.hdfs.DistributedFileSystem"); Connection= Hiveep.newconnection (true, hiveconf); Delimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", Hiveep); Txnbatch= Connection.fetchtransactionbatch (10, writer); //Batch 1txnbatch.beginnexttransaction ();  for(String json:list) {string ret= ""; Jsonobject Object=Json.parseobject (JSON);  for(inti = 0; i < fieldnames.length; i++) {                    if(i = = (fieldnames.length-1) ) {ret+=object.getstring (Fieldnames[i]); } Else{ret+ = Object.getstring (Fieldnames[i]) + ",";            }} txnbatch.write (Ret.getbytes ());        } txnbatch.commit (); } finally {            if(Txnbatch! =NULL) {txnbatch.close (); }            if(Connection! =NULL) {connection.close (); }        }    }}

PS: It is recommended to use multithreading to process data.

4. Preview

The implementation results are as follows:

    • Partition details

    • Number of records under this partition

5. Summary

When using Hive streaming to achieve ORC append, in addition to the table itself needs to distinguish between barrels, the project itself depends on the complex, will design Hadoop Hive and other project dependencies, recommend the use of MAVEN engineering to implement, from the MAVEN project to help us solve the various JAR dependencies between packages.

6. Concluding remarks

This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!

Hive Streaming Append ORC file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.