Hbase-TDG clientapi advanced features

Source: Internet
Author: User
Tags representational state transfer hadoop mapreduce hadoop ecosystem
Document directory
  • Filters
  • Counters
  • Coprocessors
  • Htablepool
  • Connection handling
  • Schema Definition
  • Hbaseadmin
  • Introduction to rest, thrift and Avro
  • Interactive clients
  • Batch clients
  • Shell
  • Web based UI
Advanced features Filters

Hbase filters are a powerful feature that can greatly enhance your tiveness working with data stored in tables. You will find predefined filters, already provided by hbase for your use, but also
Framework you can use to implement your own. You will now be introduced to both.

This feature is known from the name. Some filtering needs to be done during get and scan, so filter is required. hbase implements many filters. You can use them directly. Of course, you can also define the custom filter.

Let's look at a simple example. Compare filter. You need to provide the comparison standard (<,>, =) and the comparison logic comparator.

CompareFilter(CompareOp valueCompareOp, WritableByteArrayComparable valueComparator)

There are also many various filters based on compare filter, row filter, famliy filter, qualifierfilter

Rowfilter, this filter gives you the ability to filter data based on Row keys.

Filter filter1 = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes("row-22")));
Finally, there is a table listing the filters provided by hbase, which can be used for reference.
 
Counters

Next to the already discussed functionality hbase is offering another advanced feature: counters. applications that collect statistics-such as clicks or views in online advertising-were used to collect the data in log files that wocould subsequently be analyzed. using counters has the potential of switching to live accounting, forgoing the delayed batch processing step completely.

Counter, Which is simpler. You can also implement a counter operation by yourself,
You wowould have to lock a row, read the value, increment it, write it back, and eventually unlock the row for other writers to be able to access it subsequently.
However, hbase provides this function to count operation atomically in a single client-side call. In fact, these steps are directly performed on the server.
Example,
long incrementColumnValue(byte[] row, byte[] family, byte[] qualifier, long amount) throws IOException

 

Coprocessors

So far you have seen how you can use, for example, filters to reduce the amount of data being sent over the network from the servers to the client. another feature in hbase allows you to even move part of the computation to where the data lives: coprocessors.

Introduction to coprocessors

Using the client API, combined with specific selector mechanisms, such as filters, or column family scoping it is possible to limit what data is transferred to the client. it wocould be good though to take this further and, for example,Perform certain operations directly on the server side while only returning a small result set. Think of this a small mapreduce framework that distributes work nodes ss the entire cluster.

Coprocessors enable you to run arbitrary code directly on each region server. More precisely it executes the code on a per region basis, giving you trigger like functionality-Similar Stored Procedures in the RDBMS world. From the client side you do not have to take specific actions as the framework handles the distributed nature transparently.

To improve efficiency, you can define code scripts that are executed on the server like stored procedures.

 

Use-CasesFor coprocessors are, for instance, using hooks into row mutation operations to maintain secondary indexes, or implement some kind of referential integrity. Filters cocould be enhanced

Become stateful and therefore make decisions into SS row boundaries. aggregate functions, such as sum (), or AVG () known from RDBMS and SQL, cocould be moved to the servers to scan the data locally and only returning the single number result within ss the network.

 

Htablepool

Instead of creating a htable instance for every request from your client application it makes more senseCreate one initially and then subsequently reuse them.

The primary reason for doing so is thatCreating a htable instance is a fairly expensive operation that takes a few seconds to complete. In a highly contended environment with thousands of requests per second you wocould not be able to use this approach at all-creating the htable instance wocould be too slow. you need to create the instance at startup and use them for the duration of your client's life cycle.

Configuration conf = hbaseconfiguration. create (); htablepool pool = new htablepool (Conf, 5); // The pool size is 5 htableinterface [] tables = new htableinterface [10]; for (INT n = 0; n <10; n ++) {// although the pool size is 5, you can get 10 htable. No problem, tables [N] = pool. gettable ("testtable"); system. out. println (bytes. tostring (tables [N]. gettablename ();} For (INT n = 0; n <5; n ++) {// But the pool can have up to five, therefore, other put operations will be dropped out of the pool. puttable (tables [N]);} pool. closetablepool ("testtable ");

 

Connection handling

Every instance of htable requires a connection to the remote servers.

This is internally represented by the hconnection class, and more importantly managed process-wide by the shared hconnectionmanager class. from a user perspective there is usually no immediate need to deal with either of these two classes, instead you simply create a new configuration instance, and use that with your client api cils.


InternallyThe connections are keyed in a map, Where the key is the configuration instance you are using.

In other words, if you create a number of htable instances while providing the same configuration reference they all share the same underlying hconnection instance.

For communication between htable and remote server, you need to create connections. These connections can be shared, as long as you use the same configuration instance to create htable. share connections has some advantages, such as share zookeeper connections and cache common resources. Htable in htablepool all share configuration by default, so they will automatically share connections.Administrative features

Apart from the client API used to deal with data manipulation features, hbase also exposes a data definition like API. This is similar to the separation into DDL and DML found in RDBMSs.

 

Schema Definition

Creating a table in hbase implicitly involves the definition of a table schema, as well as the schemas for all contained column families.

They define the pertinent characteristics of how-and when-the data inside the table and columns are ultimately stored.

Hbase provides classes to define various attributes of table and column families...

HTableDescriptor(HTableDescriptor desc);HColumnDescriptor(HColumnDescriptor desc);

 

Hbaseadmin

Just as with the client API you also have an API for administrative tasks at your disposal. compare this to the Data Definition Language (DDL) found in RDBMS systems-while the client API is more ananalog to the data manipulation language (DML ).

It provides operations to create tables with specific column families, check for table existence, alter table and column family definitions, drop tables, and much more. the provided functions can be grouped into related operations, discussed separately below.

Improve the interface to create tables, alter, drop tables, etc.

 

Available clients

Hbase comes with a variety of clients that can be used from varous programming ages. This chapter is going to give you an overview of what is available.

 

Introduction to rest, thrift and Avro

Access to hbase is possible from always Ally every popular programming language, and environment. you either use the client API directly, or access it through some sort of proxy that translates your request into an API call. these proxies wrap the native Java API into other protocol APIs so that clients can be written in any language the external API provides. typically the external API is implemented in a dedicated Java based server that can internally use the provided htable client API. this simplifies the implementation and maintenance of these gateway servers.

First, all access to hbase must pass htable. Other methods are encapsulation of htable. to access hbase in other languages, it encapsulates Java objects and provides corresponding interfaces.

Where can I create htable? There are two options,

Put it directly on the client or on the gateway. Because creating htable is resource-consuming, reuse and htablepool are often considered. Therefore, you often choose to create a gateway (the gateway can be the same server as the DB)

The problem is how to communicate from the client to the gateway, because the client needs to send a request to the gateway, and then the gateway converts the request to htable access to access the real data.

The first choice is restful.

The protocol between the same ways and the clients is then driven by the available choices and requirements of the remote client. An obvious choice is the representational state transfer (abbreviatedRest) [67] which is based on existing web based technologies. The actual transport is typicallyHTTP-Which is the standard protocol for Web applications. This makes rest ideal to communicate between heterogeneous systems: the protocol layer takes care of transporting the data in an interoperable format.

RestDefines the semantics so that the protocol can be used in a generic way to address remote resources. By not changing the Protocol rest is compatible with existing technologies, such as web

Servers, and proxies. Resources are uniquely specified as part of the request URI-which is the opposite of, for example,Soap-Based [68] services, which define a new protocol that conform to a standard.

What are the differences between the restful mode and the soap mode? For details, refer to the restful related blog.

Rest is based on the HTTP protocol and is changed to defining different resources.

Soap requires custom protocol

But the problem with rest and soap is that they are both text protocols, so the overhead is very high. For the huge server farm, it will be efficient, such as bandwidth...

Both rest and soap though suffer fromVerbosity level of the Protocol. Human readable text, be it plain or XML based, is used to communicate between client and server. Transparent compression of the data sent over the network can mitigate this problem to a certain extend.

Therefore, binary protocols are required to reduce overhead. Google developed protocolbuffers but not open-source. Therefore, Facebook imitates thrift, while hadoop also developed Avro.

Especially companies with very large server farms, extensive bandwidth usage, and revoke disjoint Services felt the need to reduce the overhead and implemented their own RPC layers. One of them

WasGoogle, ImplementingProtocol Buffers. [69] Since the implementation was initially not published,FacebookDeveloped their own version, namedThrift. [70].HadoopProject Founders started a third project,Apache Avro[71], providing an alternative implementation.

All of them have similar feature sets,Vary in the number of versions ages they support, And have (arguably)Slightly better or worse levels of encoding efficiencies.

The key difference of protocol buffers to thrift and Avro is that it has no RPC stack of its own, rather it generates the RPC definitions, which have to be used with other RPC libraries subsequently.

The three differences are only Supported languages and there is no essential difference in coding efficiency. Also, protocol buffers does not develop their own RPC stack protocol and needs to use other RPC libraries.

 

Hbase ships with auxiliary servers for rest, thrift, and Avro. They are implemented as stand-alone gateway servers, which can run on shared or dedicated machines.

Since thrift and Avro have their own RPC implementation, the gateway servers simply provide a wrapper around them.

For rest hbase has its own implementation, offering access to the stored data.

Hbase also released rest, thrift, and Avro auxiliary servers

 

Interactive clients

The first group of clients are the interactive ones, those that send client API callon demand, such as get, put, or delete, to servers.

Based on the choice of the protocol you can use the supplied gateway servers to gain access from your applications.

This is not specific. You can access it through Native Java, rest, thrift, and Avro interfaces.

 

Batch clients

The opposite use-case of interactive clients is the batch access to the data. the difference is that these clients usually run asynchronously in the background, scanning large amounts of data to build, for example, search indexes, machine learning based mathematical models, or statistics needed for reporting.

Access is less user driven and therefore SLAs are more geared towards overall runtime, as opposed to per request latencies. The majority of the batch frameworks Reading and Writing from and

Hbase areMapreduceBased.

Mapreduce

The hadoop mapreduce framework is built to process petabytes of data, in a reliable, deterministic, yet easy to program way.

There are a variety of ways to include hbase as a source and target for mapreduce jobs.


Native Java

The Java based mapreduce API for hbase is discussed in chapter 7, mapreduce integration.

Clojure

TheHbase-runnerProject offers support for hbase from the functional programming language clojure. You can write mapreduce jobs in clojure while accessing hbase tables.

Hive

The Apache hive [75] project offers a data warehouse infrastructure atop hadoop. It was initially developed at Facebook, but is now part of the open-source hadoop ecosystem.

Hive offers an SQL-like Query Language, called hiveql, which allows you to query the semi-structured data stored in hadoop. the query is eventually turned into a mapreduce job, executed either locally, or on a distributed mapreduce cluster. the data is parsed at job execution time and hive employsStorage handler [76] Using action LayerThat allows for data not to just reside in HDFS, but other data sources as well. A storage handler transparently makes arbitrarily stored information available to the hiveql based user queries.

Since version 0.6.0 hive also comes with a handler for hbase. [77] You can define hive tables that are backed by hbase tables, mapping columns as required. the row key can be exposed as another column when needed.Hive uses the abstract storage layer to process data sources other than HDFS. Hive version 0.6.0 supports hbase.

Pig

The Apache pig [78] project provides a platform to analyze large amounts of data. It has its own high-level Query Language, called Pig Latin, which usesImperative programming (imperative programming)Style to formulate the steps involved to transform the input data to the final output. This is the opposite of hive'sDeclarative approach (declaration method)To emulate SQL.

The nature of Pig Latin, in comparison to hiveql, appeals to everyone with a procedural programming background, but also lends itself to significant parallation. combined with the power of hadoop and the mapreduce framework you can process massive amounts of data in reasonable time frames.

Version0.7.0Of pig introduced the loadfunc/storefunc classes and functionality, which allowsLoad and store data from other sources than the usual HDFS. One of those sources is hbase, implemented inHbasestorageClass.

Pigs support for hbase includes reading and writing to existing tables. you can map table columns as pig tuples, while Optionally include the row key as the first field for read operations. for writes the first field is always used as the row key.

Cascading

Cascading is an alternative API to mapreduce. Under the covers it uses mapreduce during execution, but during development, users don't have to think in mapreduce to create solutions for execution on hadoop.

The model used is similar toReal-world Pipe Assembly, Where data sources are taps, and outputs are sinks. these are piped together to form the processing flow, where data passes through the pipe and is transformed in the process. pipes can be connected to larger pipe assemblies to form more complex processing pipelines from existing pipes.

Data then streams through the pipeline and can be split, merged, grouped, or joined. The data is represented as tuples, forming a tuple stream through the Assembly. This very Ally oriented model makes building mapreduce jobs more like construction work, while deleting acting the complexity of the actual work involved.

Cascading (as of version1.0.1) HasSupport for reading and writing data to and from a hbaseCluster. Detailed information and access to the source code can be found on the cascading modules page.

Similar to pig and hive, but different application scenarios apply to workflows and pipe

 

Shell

The hbase shell is the command line interface to your hbase cluster (s). You can use it to connect to local or remote servers and interact with them. The shell provides both, client and administrative

Operations, refreshing the APIS discussed in the earlier chapters of this book.

 

Web based UI

The hbase processes exposes a Web-based user interface (in short UI), which you can use to gain insight into the cluster's state, as well as the tables it hosts. the majority of the functionality is read-only, but there are a few selected operation you can trigger through the UI.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.