Hcatalog is Apache open source for the table and the underlying data management Unified service platform, the latest release version is 0.5, but need hive 0.10 support, because we hive cluster version is 0.9.0, so can only demote the use of Hcatalog 0.4, since all the underlying data information in Hcatalog is stored in hive Metastore, schema changes or API changes after hive version upgrade will have an impact on Hacatalog The 0.11 has already integrated hcatalog and will later become a part of Hive, not a stand-alone project.
The Hcatalog base is dependent on the Hive Metastore, and a hivemetastoreclient is created during execution to obtain table-structured data from the API provided by this instance, if local Metastore mode is used, will return directly to a hivemetastore.hmshandler, if remote mode (hive.metastore.local set to False), depending on hive.metastore.uris (such as thrift ://10.1.8.42:9083, thrift://10.1.8.51:9083) set a sequence of URIs to establish the connection in one order. As long as there is a link to build it, and in order to avoid all the client and the first URI to establish a connection, resulting in too much load, I added a little trick to this string URIs random shuffle to do load balance
Because our cluster has Kerberos security enabled, we need to get delegationtoken, but local mode is not supported, so only remote mode can be used
Hivemetastoreclient.java
public string Getdelegationtoken (string owner, String renewerkerberosprincipalname) throws
Metaexception, texception {
if (localmetastore) {
throw new Unsupportedoperationexception ("Getdelegationtoken () can" +
"Called to Thrift (non local) mode");
}
Return Client.get_delegation_token (owner, renewerkerberosprincipalname);
}
Hcatinputformat and Hcatoutputformat provide some MapReduce APIs to read tables and write tables
Hcatinputformat API:
public static void SetInput (Job job,
Inputjobinfo inputjobinfo) throws IOException;
Instantiate a Inputjobinfo object that contains three parameter Dbname,tablename,filter and passes to the SetInput function to read the corresponding data
public static Hcatschema GetTableSchema (Jobcontext context)
throws IOException;
At runtime (such as in the Mapper phase of the Setup function), you can pass in the Jobcontext and invoke the static GetTableSchema to obtain the table schema information that was set when the previous SetInput
Hcatoutputformat API:
public static void Setoutput (Job job, Outputjobinfo Outputjobinfo) throws IOException;
Outputjobinfo accepts three parameters databasename, TableName, Partitionvalues, where the third parameter type is map<string, and the String>,partition key is placed in the map Key, partition value is placed in the value of the corresponding map key, which can pass in null or empty map, if the specified partition exists, Will throw org.apache.hcatalog.common.hcatexception:2002:partition already present with given Partition key values
For example, to write to the specified partition (dt= ' 2013-06-13 ', country= ' "), you can write
map<string, string> partitionvalues = new hashmap<string, string> ();
Partitionvalues.put ("DT", "2013-06-13");
Partitionvalues.put ("Country", "the");
Hcattableinfo info = hcattableinfo.getoutputtableinfo (dbname, Tblname, partitionvalues);
Hcatoutputformat.setoutput (Job, info);
public static Hcatschema GetTableSchema (Jobcontext context) throws IOException;
Gets the table schema information specified before Hcatoutputformat.setoutput
public static void Setschema (Final job job, final Hcatschema schema) throws IOException;
Sets the schema information for the final write data, and if this method is not invoked, the table schema information is used by default
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/
The following provides a complete MapReduce example of how many times a day each GUID accesses the page, the map phase reads the GUID field from the table, and the reduce phase counts the total number of pageview for that GUID and then writes back to another table with the GUID and Count fields
Import java.io.IOException;
Import Java.util.Iterator;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import org.apache.hadoop.io.WritableComparable;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner;
Import Org.apache.hcatalog.data.DefaultHCatRecord;
Import Org.apache.hcatalog.data.HCatRecord;
Import Org.apache.hcatalog.data.schema.HCatSchema;
Import Org.apache.hcatalog.mapreduce.HCatInputFormat;
Import Org.apache.hcatalog.mapreduce.HCatOutputFormat;
Import Org.apache.hcatalog.mapreduce.InputJobInfo;
Import Org.apache.hcatalog.mapreduce.OutputJobInfo; public class Groupbyguid extends configured implements Tool {@SuppressWarnings (' rawtypes ') public sTatic class Map extends Mapper<writablecomparable, Hcatrecord, Text, intwritable> {Hcatschema
Schema
The Text GUID;
Intwritable one; @Override protected void Setup (Org.apache.hadoop.mapreduce.Mapper.Context context) throws Ioexce
Ption, interruptedexception {guid = new Text ();
one = new intwritable (1);
Schema = Hcatinputformat.gettableschema (context); } @Override protected void map (writablecomparable key, Hcatrecord value, Contex
T context) throws IOException, interruptedexception {guid.set (value.getstring ("GUID", schema));
Context.write (GUID, one); @SuppressWarnings ("Rawtypes") public static class Reduce extends Reducer<text,
Intwritable, writablecomparable, hcatrecord> {Hcatschema schema; @OvErride protected void Setup (Org.apache.hadoop.mapreduce.Reducer.Context context) throws Ioexcept
Ion, interruptedexception {schema = Hcatoutputformat.gettableschema (context);
@Override protected void reduce (Text key, iterable<intwritable> values,
Context context) throws IOException, interruptedexception {int sum = 0;
Iterator<intwritable> iter = Values.iterator ();
while (Iter.hasnext ()) {sum++;
Iter.next ();
} Hcatrecord record = new Defaulthcatrecord (2);
Record.setstring ("GUID", schema, key.tostring ());
Record.setinteger ("Count", schema, sum);
Context.write (null, record); } @Override public int run (string[] args) throws Exception {Configuration conf = Getco
NF (); String DBname = Args[0];
String inputtable = args[1];
String filter = args[2];
String outputtable = args[3];
int reducenum = Integer.parseint (args[4]);
Job Job = new Job (conf, "groupbyguid, calculating every GUID ' PageView");
Hcatinputformat.setinput (Job, Inputjobinfo.create (dbname, inputtable, filter));
Job.setjarbyclass (Groupbyguid.class);
Job.setinputformatclass (Hcatinputformat.class);
Job.setmapperclass (Map.class);
Job.setreducerclass (Reduce.class);
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Intwritable.class);
Job.setoutputkeyclass (Writablecomparable.class);
Job.setoutputvalueclass (Defaulthcatrecord.class);
Job.setnumreducetasks (Reducenum);
Hcatoutputformat.setoutput (Job, Outputjobinfo.create (dbname, outputtable, null)); HcatsChema s = hcatoutputformat.gettableschema (Job);
Hcatoutputformat.setschema (Job, s);
Job.setoutputformatclass (Hcatoutputformat.class);
Return (Job.waitforcompletion (true)? 0:1); public static void Main (string[] args) throws Exception {int exitcode = Toolrunner.run (New Group
Byguid (), args);
System.exit (ExitCode); }
}
In fact, Hcatalog also supports dynamic partition dynamics partition, where we can specify portions partition keyvalue pair in Outjobinfo. At runtime, you can write multiple partition at the same time in a job by setting Hcatrecord corresponding other partition keyvalue pair according to the incoming value.
Author: csdn Blog Lalaguozhe