Implementation of memory Key-value cache for HBase

Source: Internet
Author: User

0x01 background

The main reasons for implementing this cache are as follows (but because it is not the actual business scenario requirement, it may not be accurate or there may not be a requirement):
* Explosive growth of unstructured data
* Processing speed is increasingly demanding
* HBase is hard disk-facing
* Memory capacity is getting bigger
* Hotspot data can be put down in memory

0x02 Design Solutions

The usual implementation of the cache, mainly in the two general direction of implementation, one in the client implementation, and the other when the server implementation
* Client Implementation
-Modify the source code of the HBase client and add the caching mechanism to the key operations such as put, get, etc.
-Design a caching service layer on the client side and implement a distributed Key-value cache system to re-encapsulate HBase client
* Service-side implementation
-Modify the source code on the HBase server side, and add the caching mechanism where put, get and other key operations
-Add a layer of proxy services on the server, resolve all client requests, and key to put, get, etc.

Two big direction of the first scenario, are directly modify HBase source code, direct modification of source performance may be better, but modify the source, will be too dependent on hbase version, for each base version update may have to re-view the source, re-modify, in addition to 1) in the scheme one, Because it is cached on the local client, there is no distributed cache, so there may be low cache hit ratios and inconsistent cache data.

Scenario two in the client implementation, which is re-encapsulated, does not modify the source code, and uses distributed caching, both to increase the cache hit rate and to resolve the problem of over-reliance on hbase, but may degrade performance.

Scenario two in the service-side implementation, by designing a caching proxy service, can also solve the over-reliance on hbase and reduce the coupling of the entire system, but implementing a proxy service is not so simple and requires a better understanding of HBase's communication mechanisms and related protocols.

Through the above analysis, considering the feasibility and technical background, the client implementation of the scenario two is the most appropriate solution, and its performance and maintainability, extensibility is relatively good. Distributed Cache system we use REDIS implementations and implement a localcache locally on the client to maximize the cache hit rate and reduce the time lag caused by communication (locality hypothesis, the same client is most likely to access the data it put).

0X03 implementation

The overall architecture diagram of the system is as follows

In accordance with the 0x02 scenario, the client in the diagram is implemented to encapsulate the native HBase client, including Redis, which mainly implements the following functions

It mainly includes the implementation of Redis cluster, and the encapsulation and LocalCache of Redis client.
* Redis Cluster Deployment: A Redis cluster can be built quickly to hydracache the cache system's distributed cache.
* Redis cache: Use Redis to cache critical data to improve system read speed.
* LocalCache: A local caching system, where there is a local hypothesis that the same client is most likely to access its put data, implementing commonly used caching strategies such as LRU and LFU.

The construction of HBase cluster

This system is in the experimental stage, and there is no really distributed environment, so use Docker in this machine everyone a distributed hbase environment. Docker is an open-source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish them to any popular Linux machine or virtualize them. Containers are completely sandbox-aware and do not have any interfaces with each other. Compared to traditional virtual machines, Docker saves resources, launches multiple containers on a common machine, and basically has no pressure, so using Docker to build hbase distributed clusters on a single computer can simulate real distributed clusters more realistically.
The system implements a Docker-based distributed cluster with the following features:
* Use serf and DNSMASQ as cluster node management and DNS resolution
* You can customize the configuration of the cluster Hadoop and hbase, and simply rebuild the image after configuration
* SSH telnet to the cluster node container

Specific installation deployment use method, use Docker stand-alone to build Hadoop fully distributed environment

Building a Redis Cluster

Redis is a high-performance Key-value database that is often used as a caching system to provide system responsiveness because its data is in-memory, but unlike a memory-caching system like memcached, Redis periodically writes data to disk.
Redis 3.0 starts with support for cluster deployment, the entire cluster architecture

There are 16,384 hash slots built into the Redis cluster, and when a key-value is required to be placed in a Redis cluster, Redis first calculates a result using the CRC16 algorithm for key and then the result to 16384 for the remainder so that each key corresponds to a number A hash slot between 0-16383, Redis maps a hash slot to a different node based on a roughly equal number of nodes.
The advantage of using hash slots is that you can easily add or remove nodes.
* When adding nodes, it is necessary to move some hash slots of other nodes to the new node.
* When you need to remove a node, simply move the hash slot on the removed node to the other node;

Detailed instructions for use can be found in my implementation of a rapidly distributed Redis solution Redis Cluster Construction

The implementation of the client

The client mainly encapsulates the HBase client, implements the caching mechanism with Redis, and implements a LocalCache function. With Redis cache, multiple clients can share cached data, shorten response time, and LocalCache improve the response speed of the same client to read recently read data, and for some scenarios, it can reduce the communication time and reduce the response time.
Classes for Redis and HBase encapsulation
   
The figure shows that the client is roughly divided into three parts, Hydracacheclientimpl,cache (LocalCache) and Rediscluster.
Hydracacheimpl is responsible for the client's external core interface, calling cache and Rediscluster to control the entire cache policy. There are four middle modes, no caches, only Redis, only LocalCache, Redis and LocalCache are used.
* Without using cached mode, internal is actually a simple call of hbase client;
* Only Redis mode, when reading the specified key, will first see if Redis has cached the data, if there is a direct read back to the client, if not present, go to HBase to read the data and add to the Redis cache;
* Only use LocalCache mode, the user can choose to use the LRU (Least recently used) and LFU (Least frequently used) in the initialization of any one of the elimination policy, when reading the specified key, The data in the local cache is first judged and returned directly if it exists, otherwise the hbase is read and stored in the cache;
* Using Redis and LocalCache mode, when reading the data will be the first to determine whether the corresponding data in the local cache, if there is a direct return, otherwise read Redis to determine whether there is data, if there is a direct return, Otherwise, it is read from HBase and then stored in the local cache and Redis, respectively.

The cache policy flowchart is as follows

The Hydracacheimpl implementation of the data cache is performed primarily in a get operation, and if there is no hit in the cache, the HBase is read and the data is cached. The core code (omit exception judgment) is as follows:

 PublicStringGet(String tableName, String RowKey, String family, String columnName,intExpiretime) {///First to determine if there is a cacheString key = tablename+"_"+rowkey+"_"+family+"_"+columnname; String valstring = Getdatafromcache (key);if(Valstring! =NULL){returnvalstring; } Table Table =NULL; Connection Connection =NULL;        Connection = Connectionfactory.createconnection (hydracacheclientimpl.conf);        Table = connection.gettable (tablename.valueof (TableName)); Get g =NewGet (Rowkey.getbytes ());        G.addcolumn (Family.getbytes (), columnname.getbytes ()); Result result = table.Get(g);byte[] bytes = Result.getvalue (Family.getbytes (), columnname.getbytes ()); String Valuestr =NewString (bytes);//set Cache        if(Valuestr! =NULL){ This. Setdata2cache (Key, Valuestr, expiretime); }returnValuestr;}PrivateStringGetdatafromcache(String key) {String val =NULL;if( This. Localcacheon && This. localcache! =NULL) {val = This. LocalCache.Get(key);if(val! =NULL){returnVal }        }if( This. Cacheon = =true&& This. rediscluster! =NULL) {val = This. Rediscluster.Get(key);if(val! =NULL){returnVal }        }returnVal;}Private void Setdata2cache(string key, String Val,intExpire) {if( This. Localcacheon = =true){ This. LocalCache.Set(Key, Val, expire* +);//Convert S to Ms}if( This. Cacheon = =true&& This. rediscluster! =NULL){ This. Rediscluster.Set(Key, Val, expire); }}

Another important issue in the cache system is the designation of the elimination policy, when the cache expires the purge and the cache reaches a limit of which caches are eliminated.

Redis offers the following strategies for arch user selection:
noenviction: Do not erase the data, just return the error, this will result in wasting more memory, most of the Write command (DEL command and other few commands exception)
ALLKEYS-LRU: Pick the least recently used data from all datasets (SERVER.DB[I].DICT) for new data use
VOLATILE-LRU: Pick the least recently used data from a dataset (Server.db[i].expires) that has been set to expire for new data use
allkeys-random: Choose from any data set (SERVER.DB[I].DICT) to retire from any data for new data use
volatile-random: Choose Data from a set of expired data sets (Server.db[i].expires) to be retired for new data use
volatile-ttl: Select obsolete data from the set expiration data set (Server.db[i].expires) for new data use

The LocalCache implements the LRU and LFU two caching strategies that allow users to choose from their own business scenarios.
What happens when the cache fails? Usually the main two methods, one is a negative method, when the primary key is accessed if it is found to be invalid, then delete it, and another active method, periodically from the setting of the failure time of the primary key to select a part of the failed primary key delete. Redis implements both methods, and Locacache uses a negative strategy to determine whether the data expires when the GET request caches the data and delete it if it expires.

0X04 test test environment

Notebook model: Lenovo y470
Operating system: Ubuntu 14.04
Memory: ddr38g
Network card: 1000Mbps
HBase Distributed Environment (Docker): Three nodes, one master,2 slave
Redis cluster: Three master and one slave per master

Test results

The HBase cluster uses Docker to build, run on the local machine, start three nodes, a master, two slave. The Redis cluster also runs on the local machine, using different ports to represent a Redis instance, starting with six, three as Master, and three as Master slave respectively.

To test the performance of HBase OST client and Hydracache, we developed a hbench for generating datasets and loads, and the data set written to HBase is a sequential sequence that is randomly generated from the data set used for load reading. The performance of the get operation was tested primarily, and the model covered the four modes in section fifth. On the choice of data set size, four sets of datasets were made, respectively 300, 600,900 and 1200 data. Results of the test

Each of the first columns in the figure represents the time to read the corresponding stripe data using the native HBase client, the second column is the time required to use the Redis cache alone, the third column is the time required to use LocalCache, and the fourth column is the time spent using both Redis and LocalCache. As you can see, the response time to use the cache is significantly smaller than the non-applicable cache, the use of the cache, the only time to use LocalCache is the shortest, followed by the use of Redis and LocalCache, and then the use of Redis alone. This is more consistent with the expected result, which, in the case of caching, is read from the cache when certain key values are read for the second time, which is much faster than reading from HBase, and reading from the LocalCache is less time-out because of reduced network traffic traffic. The reason for using Redis and LocalCache at the same time is that it is possible to read data in most cases from LocalCache, but to save the data to Redis when setting up the cache, thus increasing the time of the operation. So the total response time is more than using localcache alone, but it's a little better than using redis alone, but it's not obvious.

For the choice of cache mode, according to the actual business scenario analysis playable, such as some applications, the same client access to the data is seldom read, but may read other client-added data, this time using Redis can be satisfied, can achieve good results , while others read most of the data they have added earlier, the business scenario is good for using LocalCache.

Related projects:
Hydracache:https://github.com/kdf5000/hydrahbasecache
Docker distributed Hbase:https://github.com/kdf5000/hydra-hadoop
Distributed Redis cluster: Https://github.com/KDF5000/redis-cluster
Performance test: Https://github.com/KDF5000/HBench

Implementation of memory Key-value cache for HBase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.