Use Apache HBase to process massive amounts of data in depth learning

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Over the past few years, we have seen the real explosion in data storage and querying in a variety of ways. The database, known as NoSQL, stands at the forefront of reform and is emerging as a new, persistent storage option. The popularity of NoSQL is largely due to large companies such as Google, Amazon, Twitter and Facebook, which gather large amounts of data to save, query, and analyze. And more and more companies are collecting huge amounts of data and need to be able to use it all to enrich their business. Social networks, for example, need to analyze the user's direct social relationships and recommend relationships to users, and almost every major website has its own recommendation engine to recommend the items you might want to buy. As data accumulates more and more, they need an easy way to extend the entire system rather than reconstruct it.

Since the 1870s, relational databases (RDBMS) have almost ruled data management scenarios. But as the business expands and the amount of data stored and processed grows, relational databases become increasingly difficult to expand. At first you might start with a single computer to the master-slave node, and then add a cache layer above the database to store the load hotspot read and write data. When query performance degrades, indexes (indexes) are often discarded first, and then quickly avoid association (joins) based on the inverse Paradigm (denormalization), because the operation can consume performance very much. You may then evaluate some of the most performance-efficient queries that make these queries a valid primary key query, or distribute data from a large table to multiple database slices (shards). In retrospect, you'll find that many of the key advantages of relational databases have been discarded-referential integrity (referential integrity), transactions (ACID transactions), indexes (indexes), and so on. Of course, the scenario described here is that when your business is very successful, developing quickly requires processing more and more data to continue to grow in a high proportion of data. In other words, you are the next Twitter.

Are you? Maybe you're working on an environmental monitoring project that needs to deploy a worldwide sensor network, and all the sensors will produce a huge amount of data. Or you're studying the DNA sequence. If you understand or think you are facing a massive data storage requirement, there are billions of rows and millions of columns, you should consider hbase. This new database design is designed to be done from the infrastructure to the level expansion phase in a commercial server cluster without the need for vertical scaling to try to buy more advanced machines (and ultimately, it may not be possible to buy better machines).

First entry HBase

HBase is a database that can provide real-time, random read and write, and can store billions of rows and millions of columns. It is designed to run on top of a cluster of commercial servers, and can be extended automatically when new servers are added, as well as to ensure the same performance. In addition, there is a high fault tolerance, because the data is split into the server cluster, stored in a redundant file system such as HDFs. When some servers are abnormal, your data is still secure. This data is automatically balanced on the currently active server until the replacement server is online. HBase is a highly consistent data store. The content that you modify can be displayed immediately before all other customers.

HBase was founded after Google BigTable published a paper in 2006 "Sparse, distributed, and persistent multidimensional sort diagrams." So if you get used to relational databases, HBase seems strange at first. There is also the concept of table, but different from the table of relational database. Typical relational database concepts such as association (joins), index (indexes), transaction (ACID transactions) are not supported. But you give up these features, but you get extensibility and fault tolerance. The HBase essence is a key value (Key-value) store with automatic data versioning.

You can do the same as you want. You can also scan rows of data sequentially in hbase tables. When data is scanned in HBase, data rows are always returned in the order of the row primary key (row key). Each row of data consists of a uniquely sorted row primary key (which can be considered a primary key for a relational database) and any number of columns, each of which belongs to one column cluster (column family) and contains one or more versions of the values. Values are simple binary arrays that can be converted to a form that needs to be displayed or stored, depending on the application needs. HBase did not attempt to hide the column-oriented data model from the developer, and the Java API is obviously more low-level than any other interface you might use. For example, JPA, even JDBC, is more abstract than the HBase API. The operation of HBase is basically at the primary level.

Let's learn how to use hbase from the command line. HBase has a shell tool based on JRuby development that can define and manage tables, modify data execution, check operations, scan tables, and perform some related maintenance. Once you've entered the shell, you can get complete help with your input. You can also use the Help <group> command to obtain helpful information for a specified command or command group. For example, help ' create ' provides helpful information about creating a new table. HBase should be deployed to a server cluster in a product environment, but it can also be downloaded and run in a stand-alone mode, taking only a few minutes. The first thing to do is to use the hbase shell. The following example demonstrates creating a new blog table through the shell, displaying a valid table in the hbase, adding a blog entity, querying the entity, and scanning the blog table.

The code is as follows

Copy Code

$ bin/hbase Shell
HBase Shell; Enter ' help<return> ' for list of supported commands.
Type "exit<return>" to leave the HBase Shell
Version 0.96.0-hadoop2, r1531434, Fri Oct 15:28:08 PDT 2013

HBase (main):001:0> create ' blog ', ' info ', ' content '
0 row (s) in 6.0670 seconds

=> Hbase::table-blog

HBase (main):002:0> list
TABLE
Blog
Fakenames
My-table
3 row (s) in 0.0300 seconds

=> ["Blog", "Fakenames", "my-table"]

HBase (main):003:0> put ' blogs ', ' 20130320162535 ', ' info:title ', ' Why use HBase? '
0 row (s) in 0.0650 seconds

HBase (main):004:0> put ' blog ', ' 20130320162535 ', ' info:author ', ' Jane Doe '
0 row (s) in 0.0230 seconds

HBase (main):005:0> put ' blog ', ' 20130320162535 ', ' info:category ', ' persistence '
0 row (s) in 0.0230 seconds

HBase (main):006:0> put ' blogs ', ' 20130320162535 ', ' content: ', ' hbase is a column-oriented ... '
0 row (s) in 0.0220 seconds

HBase (main):007:0> get ' blog ', ' 20130320162535 '
COLUMN CELL
content:timestamp=1386556660599, Value=hbase is a column-oriented ...
Info:author timestamp=1386556649116, Value=jane Doe
Info:category timestamp=1386556655032, Value=persistence
Info:title timestamp=1386556643256, value=why use HBase?
4 row (s) in 0.0380 seconds

HBase (main):008:0> scan ' blogs ', {startrow => ' 20130300 ', Stoprow ' 20130400 '}
ROW Column+cell
20130320162535 column=content:, timestamp=1386556660599, Value=hbase is a column-oriented ...
20130320162535 Column=info:author, timestamp=1386556649116, Value=jane Doe
20130320162535 column=info:category, timestamp=1386556655032, value=persistence
20130320162535 Column=info:title, timestamp=1386556643256, value=why use HBase?
1 row (s) in 0.0390 seconds

In the above command, we first create a new blog table that contains the column info and content. After listing all the tables and seeing our new blog table, we added some data to the table. The put command specifies the table name, the unique row primary key, and the primary key of the column cluster consists of a column cluster name and a qualified name (qualifier), for example, info is a column cluster name, and title and author are qualified names. So, Info:title points to "Why use HBase?" in the column cluster info. Column Title,info:title is also used as the column primary key. Next, we use the command to query for a single row of data, and eventually scan the blog table data within a defined row primary key range. The start line 20130300 (included) and end line 20130400 (not included) are specified, and as you would expect, we are able to query all data in this range. In the example above, because the row primary key is the time of publication, it actually contains all 20,133 months of data. An important feature of the

HBase is that you define a column cluster, and then you can add any number of columns to the cluster, depending on the column limit name. HBase optimizes the way that columns are stored on a disk, and the nonexistent columns do not occupy space, making storage more efficient. The relational database is missing and must hold a null value (NULL) data. A data row is made up of included columns, so if there are no columns in the row, theoretically it does not exist. Following the examples above, some of the specified columns are deleted from a row of data.

code is as follows

copy code

hbase (main):009:0> delete ' blog ', ' 20130320162535 ', ' info:category '
0 Row (s) in 0.0490 seconds

HBase (main):010:0> get ' blog ', ' 20130320162535 '
column  & nbsp;          CELL
content:           timestamp=1386556660599, Value=hbase is a column-oriented ...
info:author       timestamp=1386556649116, Value=jane Doe
info: title        timestamp=1386556643256, value=why use HBase?
3 Row (s) in 0.0260 seconds

As shown above, you can delete a specified column from a table such as Info:category. You can also delete all the columns in a row by using the DeleteAll command, thereby deleting this row of data. If you update the data, you just need to use the put command again. HBase The default is to keep three versions of the data, so if you put a new value on Info:title, HBase retains the old and new two values.

The commands in the example above show how to add, delete, change, and search data in HBase. There are only two ways to query data: Using the GET command to query a single row of data, and querying multiple rows of data through scan. When querying for data in HBase, you should be careful to query only the information you need. Because HBase gets the data from each of the clusters, if you need only one column of data, you can specify to get only that part. In the following example, we only query the blog title column, specifying that the row primary key range is from March 2013 to April.

code is as follows	copy code
hbase (main): 0 11:0> Scan ' blog ', {startrow => ' 20130300 ', Stoprow => ' 20130500 ', COLUMNS => ' Info:title '} Row &nb sp; COLUMN+CELL 20130320162535 Column=info:title, timestamp=1386556643256, value=why use HBase? 1 row (s) in 0.0290 seconds

You can optimize HBase data access by setting the row primary key range, restricting the required column name, and the data version you want to query. Of course, the above examples are all done through the shell, and you can use the HBase API to do the same thing or more.

HBase is a distributed database designed to run in thousands or even more server clusters. Therefore, hbase installation is naturally more complex than installing a separate relational database on a single server. Also, typical problems that exist in all distributed computing are hbase, such as collaboration and management of remoting, locks, data distribution, network latency, and interaction between servers. Fortunately HBase uses a lot of mature technology, such as Apache Hadoop and Apache zookeeper solve a lot of similar problems. The following figure shows the main architectural components of the hbase.

As you can see in the figure above, there is a single HBase master node and multiple domain servers. (HBase can run in a multiple-master node cluster, but with only a single active master node). HBase tables are split and stored in multiple domains, each of which holds a certain range of data in the table, and the master node assigns multiple domains to a domain server.

HBase is a column-oriented storage in which data is stored in columns instead of rows. This makes data access more efficient than a row-oriented traditional data storage system. In HBase, for example, if the column cluster does not contain data, no information is stored at all. For a relational database, null values are stored. Also, if you're querying for data in hbase, you just need to specify the columns you want, because millions of rows of data may exist in one line of data, and you need to make sure you only query the data you need.

HBase uses zookeeper (distributed collaboration services) to manage domain allocations to domain servers and, when a domain server crashes, to restore functionality by loading domains from the crashed domain server to other valid domain servers.

Domains include memory data (Memstore) and persistent data (hfile), and all domains in a domain server share a first write log (write-ahead log [WAL]), which is used to save new data before the storage has been persisted, and to recover data when the domain server crashes. Each domain holds a certain range of row primary keys, and when the field contains more data than is defined, HBase divides the domain into two subdomains, extending the hbase.

As the table grows, more and more domains are built and segmented in the cluster. When a user queries a specified row primary key or a primary key within a specified range, HBase gives the domain where the primary keys are located, and the user can communicate directly with the domain servers that exist in those domains. Such a design minimizes the number of problems that may occur when searching for a specified row, and optimizes the HBase disk transfer when data is returned. Relational databases are likely to perform large disk scans before returning data from disk, even in the case of indexing.

The HDFs component uses the Hadoop Distributed File system, distributed, highly fault tolerant, extensible file system to prevent data loss, the ability to cut data into chunks and dispersed across the cluster, which is where the hbase actually store the data. Strictly speaking, any form of data that implements the API of the Hadoop file system can be pure, and generally hbase will be published in the Hadoop cluster that runs the HDFs. In fact, when you first download and install HBase on a single computer, if you do not modify the configuration, the local file system is used.

When users interact with HBase through a valid API, they include a local Java API, a rest-based interface, and some RPC interfaces (Apache Thrift, Apache Avro). Interfaces can also be accessed via groovy, Jython, and Scala (Domain specific Language [DSL])

code is as follows

copy code

configuration conf = Hbaseconfiguration.create ();
Hbaseadmin admin = new hbaseadmin (conf);
Htabledescriptor tabledescriptor = new Htabledescriptor (tablename.valueof ("people"));
Tabledescriptor.addfamily (New Hcolumndescriptor ("personal"));
Tabledescriptor.addfamily (New Hcolumndescriptor ("ContactInfo"));
Tabledescriptor.addfamily (New Hcolumndescriptor ("CreditCard"));
Admin.createtable (tabledescriptor);

The

User table definition contains three column clusters: personal information, contact information, and credit cards. You need to build a table with Htabledescriptor and add one or more columns using Hcolumndescriptor. Then call the CreateTable method to build the table. When you are done, add some data. The following code shows how to insert the data of John Doe using the Put class, specifying a name and an email designation (for the sake of simplicity, this ignores the usual error handling)

code is as follows

copy code

configuration conf = Hbaseconfiguration.create ();
htable table = new htable (conf, "people");
Put on = new put (Bytes.tobytes ("doe-john-m-12345"));
Put.add (bytes.tobytes ("personal"), Bytes.tobytes ("givenname"), Bytes.tobytes ("John"));
Put.add (bytes.tobytes ("personal"), Bytes.tobytes ("Mi"), Bytes.tobytes ("M"));
Put.add (bytes.tobytes ("personal"), Bytes.tobytes ("Surame"), Bytes.tobytes ("Doe");
Put.add (bytes.tobytes ("ContactInfo"), Bytes.tobytes ("email"), bytes.tobytes ("john.m.doe@gmail.com"));
Table.put (Put);
Table.flushcommits ();
Table.close ();

The example in the above code is that the put class provides a unique row primary key as the constructor method parameter. Next we add a value that must include a column cluster, a column identifier, and a binary array form. You may notice that the bytes class in the common tools class of the HBase API is often used, and it provides methods to convert between primitive types, strings, and binary arrays. (Add a tobytes () static reference method can save a lot of code) next we put the data into the table, refresh the commit to confirm that the local cache changes can take effect, and eventually close the table. The update data is also the same as the code shown earlier. Unlike a relational database, hbase must update the entire row of data even if only one column changes. If you only need to update a column, just specify the column you want to update in the Put class and hbase. There are also confirmed and updated actions that are essentially a series of concurrent operations, and are only updated after the user confirms the value to be replaced.

Use the Get class to query the data we just created, as shown in the following code. (Starting here, you'll ignore code such as build configurations, instantiate htable, Commit, and close)

The code is as follows	Copy Code
Get get = new Get (Bytes.tobytes ("doe-john-m-12345")); Get.addfamily (Bytes.tobytes ("personal")); Get.setmaxversions (3); Result result = Table.get (get);

The Get class is instantiated in the code in the

above, and the row primary key for the query is provided. Next we tell hbase through the Addfamily method: We only need to get the data from the Personal Information column. This reduces the hbase interaction with the disk when reading data. We also specified that each column in the result should have a maximum of three versions, which would list the historical data for each column. Finally, a result instance is returned that contains all the returned value columns that can be viewed.

In many cases you need to query for more than one line of data, hbase using a scan line. As in the second article in the HBase Shell tool to execute the scan, the following main discussion of the scan class. The scan class supports a variety of conditional options, such as the row primary key range to be queried, the columns cluster that needs to be included, and the maximum data version that needs to be displayed. You can also add a filter to define which rows and columns you want to return by customizing the filter logic. A common scenario for a filter is paging, for example, we might want to get all the people surnamed Smith, one page 25 at a time. The following code shows how to use the basic scan method.

code is as follows	copy code
scan Scan = new Scan (Bytes.tobytes ("smith-")); Scan.addcolumn (bytes.tobytes ("personal"), Bytes.tobytes ("givenname")); Scan.addcolumn (bytes.tobytes ("ContactInfo"), Bytes.tobytes ("email")); Scan.setfilter (New Pagefilter (25)); Resultscanner scanner = Table.getscanner (scan); for (Result result:scanner) { //... }

In the code above, we created a new scan class that queries the row primary key at the beginning of smith-, using AddColumn to limit the columns returned (thus reducing hbase and disk disk transfers) to Personal:givenname and ContactInfo: Email two columns. The Pagefilter in scan is used to limit the number of rows scanned by 25. (You can also consider using a paging filter that specifies the ending row primary key in the scan constructor). We use the result scan class to see just the results, loop and execute the business action. In HBase, the only way to query multiple rows of data is by scanning the primary key of a sorted row, so it is important to design a row primary key. This will be discussed later.

You can also use the Delete class to delete the data in the hbase, deleting all the columns in a row like the put class (meaning that the rows are completely deleted), and deleting the clusters, columns, and similar combinations.

Working with Connections

The above example does not pay much attention to how to handle connections and remote calls (RPC). HBase provides a hconnection class to provide a shared connection similar to a connection pool, for example, you use the GetTable () method to obtain a reference to a htable instance. Similarly, there is a Hconnectionmanager class that provides hconnection instances. Avoiding frequent exchange of networks in similar Web applications can effectively manage the number of RPCs. In the use of HBase, the return of the mass of data such as the problem is very important. A similar problem should be considered when writing hbase applications.

HBase is different from relational databases such as SQL, and has a wealth of query capabilities. Instead, it gives up the ability, as well as other relationships, joins, and so on. Focus on delivering high-performance scalability and failback. You need to design the table structure and row primary key to conform to the application's data access pattern based on row data and column clusters when using HBase. This is completely different from your relational database, in a relational database, starting with a common database structure, separate tables, and using SQL to organize the data you need joins. Designing a table in HBase specifically considers how the application will be used, and how the data is accessed beforehand. Using hbase rather than relational databases allows you to get closer to the underlying abstraction implementation details and storage mechanisms. In short, applications can store large amounts of data, high scalability, high performance, and server fault tolerance, and the potential benefits outweigh the inputs.

In previous chapters on the Java API, it was mentioned that the row primary key is critical when scanning data in HBase because it is the primary way to restrict row scan data. There are no rich queries like SQL in relational databases in HBase. The general practice is to build a scan task that sets the starting and ending row primary keys, optionally adding some filters to further restrict the rows and columns of data returned. In order to have a high degree of flexibility in scanning, the design of a primary key should include a subset of the data content that needs to be queried. In the example of people and blogs that we've been using, row primary keys are designed to be the most commonly accessed way of scanning. For example, a blog, the original row of the primary key is the release date. This allows you to scan your blog posts in ascending chronological order, but this may not be the most common pattern for viewing blogs. So the better line primary key design uses the reverse order of the timestamp, using the formula (Long.max_value–timestamp, the maximum value of the long integer minus the timestamp is the time stamp reverse) so that the most recently published blog post is returned first. This also makes it easy to scan the specified time range, such as scanning for all blog posts in the past week or January, which is a common practice in Web applications.

For example, we use a composite primary key, which distinguishes the person with the same name by surname, first name, middle initials, and personal identification number (unique), with a hyphen in between. For example, Brian m. Smith ID number 12345 for the person row primary key is smith-brian-m-12345. The start and end row primary key combinations that you specify when you scan a People table can query the person who has the specified last name, the person whose first name starts with the specified letter combination, or the person with the same last name, and initials. For example, if you want to find a person whose last name is Smith, the first letter of the name is B, you can use Smith-b as the start line primary key, smith-c as the end row primary key (the Start Row primary key is included, and the end row primary key is not included, so you can guarantee that the last The person with the first letter B of the name is included in the scan. Being able to see the concept of hbase supporting some primary keys means that you do not need to specify the exact primary key to provide a high degree of flexibility in creating the right scanning range. You can use a combination of primary key scans and filters to query only the specific data you need, optimize the data query, and provide the right data access pattern for your application.

This sample Chinese operation has a single table containing one type of information, no other association. HBase does not have a foreign key relationship similar to a relational database, but because it supports millions of columns in a single line of data, one way to design a table hbase is to keep all of the associated information on the same line-called wide table design. It's called a wide table because you keep all the associated data in a single line of data, and the data columns may be as many as the data items. In our blog example, you might want to save comments for each article. According to the broad table design principle, you can design a column cluster named comments, using the comment time as the column identification. Such comments are listed in the form of comments:20130704142510 and comments:20130707163045. HBase query data, the returned column can be sorted according to a certain principle, just like a row primary key. So to show a blog post and all of its comments, you can request content, info columns, and comment (comments) clusters to find all the data from one line. You can also add a filter to show some of the comment data, pagination shows.

People table columns can also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses. A column cluster enables all personal information for a person to be stored in one row of data. Such a design can be well adapted to the data in the column, such as blog comments and personal contact information. If you design something like an e-mail inbox, financial transactions, or a huge amount of automated collection of sensing data, you need to spread the user's e-mail, transactional, or sensing readings to a large amount of data ("High" table design) and to design a reasonable line key for effective scanning and paging. The row primary key of the Inbox may be in the form of < user id>-< Mail collection timestamp reverse > This will be able to simply scan a user's Inbox and paging, the same for the financial transaction line primary key design for < user id>-< transaction arrival time stamp reverse >. This design is called "High" table design, which is to spread the same content (such as the same sensor readings, the same account transactions) scattered at most rows of data, can be considered for the collection of constantly expanding information, such as in a large sensor network to collect data from the scene.

Designing a HBase row primary key and table structure is an important step in using hbase and will continue to be considered as the hbase infrastructure. There are other ways in hbase that can be added as optional data access channels. For example, you can use Apache Lucene to implement Full-text search, which can be used for hbase internal data or external retrieval (search under HBASE-3529). You can also build (and maintain) A level two index to allow tables to use instead of row primary key structures. For example, the primary key in a person table is a compliant primary key that consists of a name and a unique identity ID. But if we want to be able to access it based on a person's birthday, phone number, e-mail address, or some other way, we can add a level two index to implement such an interaction. Note, however, that adding a Level two index is not a lightweight operation, and every time you write data to a master table (such as a people table), all of the two-level indexes are updated! (yes, there are things that relational databases do well, but remember that HBase is designed to hold more data than traditional relational databases).

We introduced the structural design in hbase (not including relationships and SQL). Although HBase has lost some of the features of traditional relational databases, such as foreign keys, referential integrity, multiline data transactions, multilevel indexes, and so on, what needs to be applied is the inherent nature of the hbase, such as scanning, which, like many complex things, requires trade-offs. In HBase, we give up the rich table structure design and query flexibility, but get the ability to enlarge the capacity to accommodate a large number of data, simply to add a server to the cluster.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Apache HBase to process massive amounts of data in depth learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Apache HBase to process massive amounts of data in depth learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support