Hbase vs CASSANDRA: why we moved (from: http://blog.csdn.net/wdwbw/article/details/5366739)

Source: Internet
Author: User
Tags cassandra hadoop mapreduce

Address: http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved

 

Hbase vs CASSANDRA: why we moved

 

The following describes why Cassandra is selected as our nosql solution.

 

Does Cassandra's lineage predict the future?

I found that in terms of software problems, we should first consider the upper-layer issues, instead of going into details directly, which can save a lot of time. I also followed this message when selecting hbase Or Cassandra. Hbase Or Cassandra has completely different lineage and genes, which determines their feasibility in our application.

Hbase and its support systems are derived from Google's GFS and bigtable designs. Cassandra, originally open-source from Facebook, adopts the bigtable data model, it does use a dynamo storage system similar to Amozon (in fact, Cassandra was initially developed by two original dynamo engineers ).

In my opinion, their root cause determines that hbase is more suitable for data warehouses and large-scale data processing and analysis (such as Web indexing), while Cassandra is more suitable for real-time transaction processing and processing of interactive data. (Hbase committers were acquired by Ms Bing, while Cassandra's committers work for rackspace, which aims to provide general nosql solutions outside of Google, Yahoo, and Amazon)

 

Which nosql database has a better momentum?

Another factor is that Cassandra is currently gaining traction in the community. Software platforms tend to become larger and easier to use. People like to use systems with better support.

When I started to focus on hbase, I felt that it had good community support (mainly because of stumpleupon and streamy's CTO reports and "hbase vs CASSANDRA: nosql battle !"), But now I believe Cassandra will be more powerful than it.

To prove this, you can refer to the developer activity on IRC: when connected to freenode.org

Compared with # hbase and # Cassandra's Developer channels, you will find that Cassandra's developers are twice as many as the former.

In addition, Twitter plans to use Cassandra on a large scale.

 

Cap: Ca vs AP

According to the CAP theory of Eric Brewer, in the design of large-scale distributed systems, C (consistency) A (availability) P (Network partition adequacy, that is, the system is still available when the cluster is divided into multiple isolated partitions. It is generally believed that hbase chose CP and Cassandra chose AP.

However, I must remind you that such management is based on an illogical inference. Although CAP cannot meet the requirements at the same time, in a system, each operation can specify which two of them should be abandoned, or to what extent do you pay attention to the CAP and obtain a balance you need in the middle. This is what Cassandra does.

I want to reiterate Cassandra's advantages over and over again: You can choose trade-off for each operation. For example, when a read operation requires high consistency, the "all" consistency level is used (note: in fact, the hint point may be written due to a temporary fault, even if I use all, I may not be able to read the expected results.) When I have no high requirements on consistency and require performance, I will use the "one" consistency result. In addition, you can choose a consistency level between the two, such as "quorum" (voting, that is, majority ).

In addition, when some nodes fail or the network jitters occur, Cassandra still ensures that most operations are available except for some requests that require extremely high consistency. Hbase cannot achieve this flexibility.

 

 

When is monolithic better than modular?

An important difference is that each Cassandra node is a single Java Process. The complete hbase solution consists of multiple parts: Database processes running in multiple modes, hadoop HDFS and zookeeper systems.

For small companies, the configuration of hbase solutions is too complicated. If a database administrator wants to learn about the nosql system, hbase is a good choice.

 

Gossip!

Cassandra is a completely symmetric system. Management Nodes do not exist in the system as hbase does. All nodes in the system play the same role. The coordination function in the system is completely achieved by the nodes in the Cluster following the pure P2P protocol gossip. Cassandra relies on this Protocol to detect node faults, or route requests to appropriate nodes for processing, which takes a relatively small amount of time.

This gossip-based architecture brings the following benefits to users: first, the system management is extremely simple. For example, to add a new node, the node will communicate with the seed node to complete the bootstrapping process and prepare data and route information. In addition, this P2P architecture provides good performance and availability. The load can be well balanced in the system, and the network partition failure or node failure can be seamlessly solved, this complete symmetry also avoids the temporary performance instability that occurs when hbase is added/removed from the node.

 

Third-party reports

Yahoo has made a detailed comparison of nosql systems. The results show that Cassandra is more advantageous. Hbase is advantageous only in range scan. But I think you should actually implement your own index based on Cassandra, instead of directly using range scan. If you are interested in Cassandra's range query and storage index, refer to my other article /.

The following are related reports:

Http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king

Http://www.brianfrankcooper.net/pubs/ycsb.pdf

 

 

 

Lock and adequacy

 

You may hear from the hbase camp that their complex architecture can provide the benefits that Cassandra's P2P architecture cannot provide, such as row locks. But I want to talk about atomicity. Cassandra implements the bigtable data model, but uses a distributed model with symmetric points. This is a flexible and efficient model. If you need locks, transactions, or other functions, you can add your own modules. For example, we can use zookeeper in Cassandra to implement scalable locking.

You need to lock and use zookeeper. You need to index and use lucandra... cassandra does not impose complexity that may not be used, but provides flexibility that allows you to add modules you need to complete functions.

 

Mapreduce

A major weakness of Cassandra is mapreduce. Because hbase uses hadoop HDFS to store data, hbase is designed for analysis and processing such as mapreduce. If you need such data analysis, hbase is indeed the best choice.

Although I am here touching Cassandra, I must point out that hbase and Cassandra are not competitors, but they are actually more suitable for scenarios. As far as I know, stumbleupon uses hbase extremely hadoop mapreduce to process massive post. Our system is more interactive applications, so we choose Cassandra.

Cassandra has been supporting hadoop since 0.6, And I believe its mapreduce support will be better and better.

 

Data loss?

Through the CAP debate, we are prone to the impression that hbase is safer than Cassandra. In fact, in Cassandra, when you write new data, it will immediately write the data to the commit log and copy it to other nodes. This will make your cluster system lose power in time, but only a small amount of data is lost. Cassandra also uses the Merkle tree to discover data inconsistency between replicas, further improving data security.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.