1. Overview
When storing business data, as the business grows, the data stored on the HDFS by the Hive table increases over time and is stored in HDFs in text format, which consumes a huge amount of resources. Well, we need to have a way to reduce the cost of capacity. In Hive, there is an ORC file format that can significantly reduce the cost of storage capacity. Today, I will share with you how to implement streaming data to the Hive ORC table.
2. Content 2.1 ORC
Here, we first need to know what the ORC is in Hive. Prior to this, there was an RC file in Hive, and the presence of the Orc, which was optimized for RC files, provides an efficient way to store hive data, using ORC files to provide hive read-write and performance. The advantages are as follows:
- Reduced load on NameNode
- Support for complex data types (such as list,map,struct, etc.)
- The file contains an index
- Block compression
- ...
The structure chart (from Apache ORC official website) is as follows:
Here the author does not list, more details, you can read the official website introduction: [Entry Address]
2.2 Use
Knowing the structure of the Orc file, and how we go about using the ORC table, let's create an example of a table with a Stream record, as shown below:
Create Table int , msg string) by (Continent string, country string) Clustered by into 5 Buckets as Orc tblproperties ("transactional"=// are for streaming
It is important to note that when using streaming, creating ORC tables requires the use of a split bucket.
Below, we try to insert the data to simulate the streaming process, as shown in the code below:
String DbName = "testing"= "Alerts"; ArrayListNew arraylist<string> (2);p Artitionvals.add ("Asia");p Artitionvals.add ( "India"= "Org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"; New Hiveendpoint ("thrift://x.y.com:9083", DbName, Tblname, partitionvals);
If there are multiple partitions, we can store the partitions in the partition collection and load them. Here, you need to turn on the Metastore service to ensure that Hive's Thrift service is available.
//-------Thread 1-------//Streamingconnection connection = Hiveep.newconnection (true);D elimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", ENDPT); Transactionbatch Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 1-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("1,hello streaming". GetBytes ()); Txnbatch.write ("2,welcome to streaming". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 1-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("3,roshan Naik". GetBytes ()); Txnbatch.write ("4,alan Gates". GetBytes ()); Txnbatch.write ("5,owen O ' Malley". GetBytes ()); Txnbatch.commit (); Txnbatch.close (); Connection.close ();} Txnbatch= Connection.fetchtransactionbatch (10, writer);/////Batch 2-first TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("6,david Schorow". GetBytes ()); Txnbatch.write ("7,sushant Sowmyan". GetBytes ()); Txnbatch.commit ();if(txnbatch.remainingtransactions () > 0) {/////Batch 2-second TXNtxnbatch.beginnexttransaction (); Txnbatch.write ("8,ashutosh Chauhan". GetBytes ()); Txnbatch.write ("9,thejas Nair"getBytes ()); Txnbatch.commit (); Txnbatch.close ();} Connection.close ();
Next, we write the streaming data to the ORC table for storage. The implementation results are as follows:
3. Case studies
Below, we are going to complete a case where there is a scenario where a lot of business data is escalated to the specified server each day, and then a staging service forwards the individual business data to the respective log nodes by business splitting, and then the ETL service will put the data into the Hive table. Here, we only talk about the flow of inbound hive table, get the data, processed, into the ORC table hive. The concrete implementation code looks like this:
/*** @Date * * @Author smartloli * * @Email [Email protected] * * @Note TODO*/ Public classIploginstreamingextendsThread {Private Static FinalLogger LOG = Loggerfactory.getlogger (iploginstreaming.class); PrivateString Path = ""; Public Static voidMain (string[] args)throwsException {string[] paths= Systemconfigutils.getpropertyarray ("Hive.orc.path", ","); for(String str:paths) {iploginstreaming Iplogin=Newiploginstreaming (); Iplogin.path=str; Iplogin.start (); }} @Override Public voidrun () {List<String> list = Fileutils.read ( This. Path); LongStart =System.currenttimemillis (); Try{write (list); } Catch(Exception e) {log.error ("Write path[" + This. Path + "] ORC have error,msg is" +e.getmessage ()); } System.out.println ("Path[" + This. Path + "] spent [" + (System.currenttimemillis ()-start)/1000.0 + "s]"); } Public Static voidWrite (list<string>list)throwsConnectionerror, Invalidpartition, invalidtable, partitioncreationfailed, impersonationfailed, Interruptedexception, ClassNotFoundException, Serializationerror, Invalidcolumn, streamingexception {String dbName = "Default"; String tblname= "Ip_login_orc"; ArrayList<String> partitionvals =NewArraylist<string> (1); Partitionvals.add (Calendarutils.getday ()); String[] FieldNames=NewString[] {"_bpid", "_gid", "_plat", "_tm", "_uid", "IP", "latitude", "Longitude", "Reg", "Tname" }; Streamingconnection Connection=NULL; Transactionbatch Txnbatch=NULL; Try{hiveendpoint Hiveep=NewHiveendpoint ("thrift://master:9083", DbName, Tblname, partitionvals); hiveconf hiveconf=Newhiveconf (); Hiveconf.setboolvar (HiveConf.ConfVars.HIVE_HADOOP_SUPPORTS_SUBDIRECTORIES,true); Hiveconf.set ("Fs.hdfs.impl", "Org.apache.hadoop.hdfs.DistributedFileSystem"); Connection= Hiveep.newconnection (true, hiveconf); Delimitedinputwriter writer=NewDelimitedinputwriter (FieldNames, ",", Hiveep); Txnbatch= Connection.fetchtransactionbatch (10, writer); //Batch 1txnbatch.beginnexttransaction (); for(String json:list) {string ret= ""; Jsonobject Object=Json.parseobject (JSON); for(inti = 0; i < fieldnames.length; i++) { if(i = = (fieldnames.length-1) ) {ret+=object.getstring (Fieldnames[i]); } Else{ret+ = Object.getstring (Fieldnames[i]) + ","; }} txnbatch.write (Ret.getbytes ()); } txnbatch.commit (); } finally { if(Txnbatch! =NULL) {txnbatch.close (); } if(Connection! =NULL) {connection.close (); } } }}
PS: It is recommended to use multithreading to process data.
4. Preview
The implementation results are as follows:
- Number of records under this partition
5. Summary
When using Hive streaming to achieve ORC append, in addition to the table itself needs to distinguish between barrels, the project itself depends on the complex, will design Hadoop Hive and other project dependencies, recommend the use of MAVEN engineering to implement, from the MAVEN project to help us solve the various JAR dependencies between packages.
6. Concluding remarks
This blog is to share with you here, if you study in the process of learning what is the problem, you can add groups to discuss or send e-mail to me, I will do my best to answer for you, with June encouragement!
Hive Streaming Append ORC file