Cassandra and HBase are the representatives of many open source projects based on bigtable technology that are implementing high scalability, flexibility, distributed, and wide-column data storage in different ways.
In this new area of big data [note], the BigTable database technology is well worth our attention because it was invented by Google, and Google is a well-established company that specializes in managing massive amounts of data. If you know this well, your family is familiar with the two Apache database projects of Cassandra and HBase.
Google first bigtable in a 2006 study. Interestingly, the report did not use BigTable as a database technique, but rather as a "sparse distributed multidimensional" mapping technique to store byte-level data and run them on commercial hardware. Rows are indexed in a very unique way, and then bigtable the data by using row keys to distribute them to the cluster. Columns can be quickly defined in a row, allowing BigTable to apply to most modeless environments.
Cassandra and HBase have largely borrowed from the definition of early bigtable. In fact, Cassandra originated in BigTable and Amazon's Dynamo technology, hbase positioning itself as "open source BigTable tool." On its own, these two projects have many of the same characteristics, with many significant differences.
Born with large data
Cassandra and HBase are NoSQL databases. Overall, this means that the user cannot use the SQL database. However, Cassandra uses CQL (Cassandra query language), whose syntax has a clear imitation of the SQL Trace.
Both are designed to manage very large datasets. The HBase file claims that a HBase database can have hundreds of millions of or even billions of rows. In addition, users are advised to continue using relational databases.
Both are distributed databases, not just in the way the data is stored, but also in the way the data is accessed. Clients can connect to any node in the cluster and access arbitrary data.
Both claim to have an extended capability similar to line style. Want to manage twice times the size of data? Users only need to extend the nodes in the cluster twice times.
Both are replicated to prevent cluster node failures and result in data loss. The rows written to the database are primarily owned by a single cluster node (row to node mappings depend on the partition mode used by the user). The data is mirrored to other cluster members called redundant nodes (the number of user-configurable replication factors is displayed). If the primary node fails, the data can still be read from another redundant node.
Both are called column databases. Because their names sound like relational databases, users need to be mentally tuned in their contacts, which can lead to confusion about their perceptions. The most confusing place is when the data is initially arranged on the surface by rows, and the primary key of the table is the row key. However, unlike a relational database, there are no two rows in the column database that require the same columns. As mentioned above, when a table is created, users can quickly add columns to the row. In fact, you can add many columns to a row. Although the maximum value is difficult to calculate accurately, it is almost impossible for users to reach such an upper limit, even if they add a large number of columns.
In addition to these features derived from the bigtable definition, Cassandra and HBase have some other similarities.
First, both use a similar write path, where the write operation is first recorded in the log file to ensure persistence. Even if a write failure prompt occurs, the record of the action saved in the log can be restarted. The data is then written to the memory cache. Finally, data is written to disk through a large number of write operations (in effect, copying a copy of the memory cache to disk). The memory and disk data structures used by Cassandra and HBase are in some ways a merged tree of log structures. The Cassandra Disk component is the hfile of the Sstable,hbase disk component.
Both provide a command line shell for the JRuby language. Both are heavily written through the Java language, which is the primary programming language to access them, although in many other programming languages there are client packages that fit both.
Of course, both Cassandra and HBase are open source projects managed by the Apache Software Foundation, both of which can be obtained free of charge through the Apache License version 2.0 license.
Similarities and differences
Although there are many similarities between the two, there are still many important differences between them.
Although the nodes in Cassandra and HBase are symmetric, this means that the client can be connected to any node in the cluster, but this symmetry is incomplete. Cassandra requires the user to take some nodes as seed nodes and let them play the role of the set stream point in the communication between the clusters. In HBase, the user must have some nodes acting as the primary node, and their function is to monitor and coordinate the actions of the regional servers. To ensure high availability, the Cassandra takes the form of allowing multiple seed nodes to be set in the cluster, hbase the alternate master node, and if the current primary node fails, the backup master becomes the new primary node.
Cassandra uses the gossip protocol in communication between nodes. Currently gossip services have been integrated with Cassandra Software. HBase relies on completely independent distributed application zookeeper to handle the corresponding tasks. Although HBase is shipped with zookeeper, users often use the zookeeper preset in the HBase database.
While Cassandra and HBase do not support real-time transaction control, both provide a degree of consistency control. HBase provides the user with a record level (that is, line level) consistency. In fact, HBase supports acid-level semantics on every line. A user can lock a row in HBase, but this behavior is not encouraged because it affects not only concurrency, but also row locking that can result in a zone splitting operation. In addition, HBase can perform a "check and write" operation that provides the semantics of read-modify-write on a single data element.
The Cassandra Free DataStax Community Edition contains a DataStax operation center. The Center provides a cluster monitoring and management function that detects the database schema, prompts for key space to be edited, and whether the column family can be added or deleted.
Although Cassandra is described as having "ultimate" consistency, read and write consistency can be adjusted in terms of level and interval. That is, you can configure not only the number of redundant nodes that must successfully complete the operation, but also whether the participating redundant nodes cross the data center.
In addition, Cassandra has added some lightweight transactions to its computer instruction system. Cassandra's lightweight trading uses the "compare and set" mechanism, which is equivalent to the HBase "check and write" function. However, for the HBase "read-Modify-write" operation, Cassandra lacks the corresponding functionality. Eventually, the 2.0 version of Cassandra adds a separate row-level write capability. If a client updates more than one column in a row, the other client will see all the parts that are not updated, or all the updates.
In Cassandra and HBase, the primary index is a row key, but the data is stored on disk, which causes the members of the column family to be very close to each other. It is therefore very important to carefully plan the group. To maintain high query performance, columns with the same access pattern should be placed among the same column families. Cassandra allows users to create additional secondary indexes on column values. This initiative promotes data access to columns with high repeatability of values, such as those that store the country area of the customer's e-mail address. HBase lacks built-in support for secondary indexing, but there are mechanisms that provide the ability to index the secondary indexes. These are mentioned in the HBase Online reference guide and the HBase Community blog.
As mentioned earlier, all two databases have a command-line shell that publishes data manipulation commands. Since both HBase and Cassandra Shells are based on jruby shells, users can write scripts that enable them to invoke all resources in the JRuby shell to interact with the specific APIs provided by the database. In addition, Cassandra also defines CQL that mimics the SQL. Compared with the query language used by HBase, CQL is more powerful and can be executed directly in Cassandra Shell.
Although Cassandra still supports the thrift API, Cassandra has been pushing CQL to become the primary editing interface for the database. Cassandra's documentation includes drivers that use CQL version 3 for Java, C #, and Python. Eventually, Cassandra will be given a JDBC driver. The driver replaces SQL with CQL and CQL as data definition and data management language.
HBase also supports thrift interfaces and restful Web service interfaces, but HBase native Java APIs provide rich functionality to programmers (as shown in the drawing). Although HBase's data manipulation commands are not CQL rich, HBase has a "filter" feature that can be executed on the server side of the session, dramatically boosting the throughput of the scan (search).
HBase also introduces the concept of "coprocessor" (coprocessors), which allows user code to be executed in hbase processes. This is basically the same as triggering and stored processes in a relational database. Currently, Cassandra does not have a function similar to the HBase coprocessor.
Cassandra documents are more visible than hbase and have a more flattened learning curve. It is simpler to set up a Cassandra cluster for development than to set up a hbase cluster. Of course, this is only important for development and testing purposes.
Figure HBase The master node hosts a web interface on port 60010. Users can browse information such as the History of node execution, tables managed by nodes, regional servers in the primary node domain, and so on.
Tricky place
Users need to do some work when they have to adjust the cluster for a particular application. After specifying the size of a dataset, creating and managing the complexity of a multi-node cluster, which typically spans multiple data centers, tuning will become tricky. Users need to understand the cluster's memory cache, disk storage and communication between nodes, and carefully monitor the activities of the cluster.
HBase's reliance on zookeeper can lead to additional points of failure. While Cassandra avoids the problem, this does not mean that the adjustment of the Cassandra Cluster will be significantly reduced. We compared the cluster tuning difficulties of the two databases (as shown in the schedule).
It should be stated that there is no certainty as to who is the winner and who is the loser. Supporters of each database will find some evidence that their systems are superior to each other. Typically, users need to test two databases before they can determine how they perform the target application. Will there be a better way from a technical standpoint?