Hcatalog Introduction and use

Source: Internet
Author: User
Tags count final random shuffle static class

Hcatalog is Apache open source for the table and the underlying data management Unified service platform, the latest release version is 0.5, but need hive 0.10 support, because we hive cluster version is 0.9.0, so can only demote the use of Hcatalog 0.4, since all the underlying data information in Hcatalog is stored in hive Metastore, schema changes or API changes after hive version upgrade will have an impact on Hacatalog The 0.11 has already integrated hcatalog and will later become a part of Hive, not a stand-alone project.

The Hcatalog base is dependent on the Hive Metastore, and a hivemetastoreclient is created during execution to obtain table-structured data from the API provided by this instance, if local Metastore mode is used, will return directly to a hivemetastore.hmshandler, if remote mode (hive.metastore.local set to False), depending on hive.metastore.uris (such as thrift ://10.1.8.42:9083, thrift://10.1.8.51:9083) set a sequence of URIs to establish the connection in one order. As long as there is a link to build it, and in order to avoid all the client and the first URI to establish a connection, resulting in too much load, I added a little trick to this string URIs random shuffle to do load balance

Because our cluster has Kerberos security enabled, we need to get delegationtoken, but local mode is not supported, so only remote mode can be used

Hivemetastoreclient.java

public string Getdelegationtoken (string owner, String renewerkerberosprincipalname) throws
    Metaexception, texception {  
  if (localmetastore) {  
    throw new Unsupportedoperationexception ("Getdelegationtoken () can" +  
        "Called to Thrift (non local) mode");  
  }  
  Return Client.get_delegation_token (owner, renewerkerberosprincipalname);  
}

Hcatinputformat and Hcatoutputformat provide some MapReduce APIs to read tables and write tables

Hcatinputformat API:

public static void SetInput (Job job,  
    Inputjobinfo inputjobinfo) throws IOException;

Instantiate a Inputjobinfo object that contains three parameter Dbname,tablename,filter and passes to the SetInput function to read the corresponding data

public static Hcatschema GetTableSchema (Jobcontext context)   
    throws IOException;

At runtime (such as in the Mapper phase of the Setup function), you can pass in the Jobcontext and invoke the static GetTableSchema to obtain the table schema information that was set when the previous SetInput

Hcatoutputformat API:

public static void Setoutput (Job job, Outputjobinfo Outputjobinfo) throws IOException;

Outputjobinfo accepts three parameters databasename, TableName, Partitionvalues, where the third parameter type is map<string, and the String>,partition key is placed in the map Key, partition value is placed in the value of the corresponding map key, which can pass in null or empty map, if the specified partition exists, Will throw org.apache.hcatalog.common.hcatexception:2002:partition already present with given Partition key values

For example, to write to the specified partition (dt= ' 2013-06-13 ', country= ' "), you can write

map<string, string> partitionvalues = new hashmap<string, string> ();  
Partitionvalues.put ("DT", "2013-06-13");  
Partitionvalues.put ("Country", "the");  
Hcattableinfo info = hcattableinfo.getoutputtableinfo (dbname, Tblname, partitionvalues);  
Hcatoutputformat.setoutput (Job, info);

public static Hcatschema GetTableSchema (Jobcontext context) throws IOException;

Gets the table schema information specified before Hcatoutputformat.setoutput

public static void Setschema (Final job job, final Hcatschema schema) throws IOException;

Sets the schema information for the final write data, and if this method is not invoked, the table schema information is used by default

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/

The following provides a complete MapReduce example of how many times a day each GUID accesses the page, the map phase reads the GUID field from the table, and the reduce phase counts the total number of pageview for that GUID and then writes back to another table with the GUID and Count fields

Import java.io.IOException;  
      
Import Java.util.Iterator;  
Import org.apache.hadoop.conf.Configuration;  
Import org.apache.hadoop.conf.Configured;  
Import org.apache.hadoop.io.IntWritable;  
Import Org.apache.hadoop.io.Text;  
Import org.apache.hadoop.io.WritableComparable;  
Import Org.apache.hadoop.mapreduce.Job;  
Import Org.apache.hadoop.mapreduce.Mapper;  
Import Org.apache.hadoop.mapreduce.Reducer;  
Import Org.apache.hadoop.util.Tool;  
Import Org.apache.hadoop.util.ToolRunner;  
Import Org.apache.hcatalog.data.DefaultHCatRecord;  
Import Org.apache.hcatalog.data.HCatRecord;  
Import Org.apache.hcatalog.data.schema.HCatSchema;  
Import Org.apache.hcatalog.mapreduce.HCatInputFormat;  
Import Org.apache.hcatalog.mapreduce.HCatOutputFormat;  
Import Org.apache.hcatalog.mapreduce.InputJobInfo;  
      
Import Org.apache.hcatalog.mapreduce.OutputJobInfo; public class Groupbyguid extends configured implements Tool {@SuppressWarnings (' rawtypes ') public sTatic class Map extends Mapper<writablecomparable, Hcatrecord, Text, intwritable> {Hcatschema  
        Schema  
        The Text GUID;  
      
        Intwritable one; @Override protected void Setup (Org.apache.hadoop.mapreduce.Mapper.Context context) throws Ioexce  
            Ption, interruptedexception {guid = new Text ();  
            one = new intwritable (1);  
        Schema = Hcatinputformat.gettableschema (context); } @Override protected void map (writablecomparable key, Hcatrecord value, Contex  
            T context) throws IOException, interruptedexception {guid.set (value.getstring ("GUID", schema));  
        Context.write (GUID, one);  @SuppressWarnings ("Rawtypes") public static class Reduce extends Reducer<text,  
      
        Intwritable, writablecomparable, hcatrecord> {Hcatschema schema; @OvErride protected void Setup (Org.apache.hadoop.mapreduce.Reducer.Context context) throws Ioexcept  
        Ion, interruptedexception {schema = Hcatoutputformat.gettableschema (context);  
                @Override protected void reduce (Text key, iterable<intwritable> values,  
            Context context) throws IOException, interruptedexception {int sum = 0;  
            Iterator<intwritable> iter = Values.iterator ();  
                while (Iter.hasnext ()) {sum++;  
            Iter.next ();  
            } Hcatrecord record = new Defaulthcatrecord (2);  
            Record.setstring ("GUID", schema, key.tostring ());  
            Record.setinteger ("Count", schema, sum);  
        Context.write (null, record); } @Override public int run (string[] args) throws Exception {Configuration conf = Getco  
      
        NF (); String DBname = Args[0];  
        String inputtable = args[1];  
        String filter = args[2];  
        String outputtable = args[3];  
      
        int reducenum = Integer.parseint (args[4]);  
        Job Job = new Job (conf, "groupbyguid, calculating every GUID ' PageView");  
      
        Hcatinputformat.setinput (Job, Inputjobinfo.create (dbname, inputtable, filter));  
        Job.setjarbyclass (Groupbyguid.class);  
        Job.setinputformatclass (Hcatinputformat.class);  
        Job.setmapperclass (Map.class);  
        Job.setreducerclass (Reduce.class);  
        Job.setmapoutputkeyclass (Text.class);  
        Job.setmapoutputvalueclass (Intwritable.class);  
        Job.setoutputkeyclass (Writablecomparable.class);  
        Job.setoutputvalueclass (Defaulthcatrecord.class);  
      
        Job.setnumreducetasks (Reducenum);  
        Hcatoutputformat.setoutput (Job, Outputjobinfo.create (dbname, outputtable, null)); HcatsChema s = hcatoutputformat.gettableschema (Job);  
      
        Hcatoutputformat.setschema (Job, s);  
      
        Job.setoutputformatclass (Hcatoutputformat.class);  
    Return (Job.waitforcompletion (true)? 0:1); public static void Main (string[] args) throws Exception {int exitcode = Toolrunner.run (New Group  
        Byguid (), args);  
    System.exit (ExitCode); }  
}

In fact, Hcatalog also supports dynamic partition dynamics partition, where we can specify portions partition keyvalue pair in Outjobinfo. At runtime, you can write multiple partition at the same time in a job by setting Hcatrecord corresponding other partition keyvalue pair according to the incoming value.

Author: csdn Blog Lalaguozhe

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.