The Hadoop cluster uses HBase to query and optimize massive data more efficiently

Source: Internet
Author: User
Keywords Hadoop massive data HBase

This article will help readers in the large Data cloud computing Hadoop cluster applications to use HBase more efficient, intuitive, easy to store, query and optimize the mass of data.

November 2006, Google published a paper entitled "BigTable", February 2007, the developers of Hadoop to implement it and named HBase. HBase is a new type of data storage architecture based on column storage based on Hadoop, which solves large data problem and is a distributed database of Hadoop.

HBase is now quite mature and the latest stable version is 0.94.x. HBase has been adopted by many large companies, such as Facebook, Twitter,adobe, Cloudera, IBM, and so on. HBase is not a database based on a traditional RDBMS, but rather a data base that uses disk to store formats, with the advantage of providing fast access based on specific columns and sequential ranges of keywords.

HBase has three important components: a client library, a master server (which can be configured with multiple standby master, described later in this article), and multiple Region servers. Master is responsible for allocating Region to various Region servers, and Region server is responsible for storing the actual data. At the same time, HBase through the use of zookeeper, a reliable, highly available, consistent distributed collaborative services to help them complete the corresponding tasks. HBase Cluster administrators can adjust the workload by adding and removing Region Server nodes during system operation. HBase uses hfile as the basic format for storing data, and the underlying file system defaults to HDFS.

Figure 1. Hbases Basic Architecture

Figure 1. Shows how different components, such as hdfs,zookeeper, work in coordination with HBase. Master server handles the import balance of regions data across Region server, unloads the busy Region server burden, and transfers Region to a more free Region server.

HBase Master is not responsible for actual data storage, it coordinates import balancing, maintains cluster status, maintains schema changes, and metadata metadata operations, such as creating Tables and column families (column families), but never providing any data services.

Region servers is responsible for loading and maintaining Region, including processing all read and write requests to its managed Region, and splitting the Region size when it grows beyond the configured threshold.

After the client obtains the region server for the region that it needs to read and write through communication with zookeeper, it communicates directly with region server and the region server processes all the related requests.

HBase in IBM biginsights architecture

IBM Big Data Product Infosphere Biginsights is a large data management and analysis platform, and its underlying architecture uses Hadoop and HBase to store and query both structured and unstructured data.

HBase in Biginsightsz cluster software hierarchy

Biginsights integrates many of the existing Hadoop open source components, such as HDFS, MapReduce, HBase, zookeeper, and so on, incorporating them well into the Biginsights software system and with other Biginsights Components work together on the same platform. HBase is used as a biginsights storage database, and zookeeper is used as a biginsights service Synergy component. If you want to use HBase, we need to install Hadoop at the same time, zookeeper, because HBase uses Hadoop as its file system, using zookeeper as its service synergy support.

When deploying Biginsights to a cluster, the structure of the software hierarchy is shown in Figure 2. Shown here:

Figure 2. Biginsights Hadoop Open Source Component list

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.