Debate over big data: Will HBase dominate NoSQL?

Source: Internet
Author: User
Keywords And breakdown debate big data
HBase provides both scalability and the economics of sharing the same infrastructure as Hadoop, but does its flaws rip off its hind legs? The NoSQL expert laid out the debate frame.


HBase is part of the world's most popular large data-processing platform, Apache Hadoop, modeled after Google BigTable. But can this lineage guarantee hbase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of
MapR believes that the popularity of Hadoop and hbase scalability and consistency can ensure success. The growing HBase community will surpass other open source campaigns and will overcome some technical problems that need further study.

Jonathan Ellis, who supports supplier DataStax work behind Open-source projects Cassandra, argues that HBase needs to overcome too many flaws and is embedded in the HDFS architecture of Hadoop
. He says these flaws will forever limit hbase to projects that apply to high-speed workloads.


Please read our two NoSQL experts for different opinions, and then in the comments section below, use your comments to participate in the debate.


Square


Michael Hausenblas

Chief Data Engineer of
EMEA,MAPR technology company


and Hadoop integration will drive acceptance


the answer to this question is a clear "Yes, but ..."


in order to grasp this answer, we need to step back and understand the problem in context. Martin Fowler in 2011 and Mike Stonebraker for 2005 years with "a knowledge of the persistence of multiple languages" that "one dimension does not apply to everything".


Therefore, I would like to explain that "dominance" in the question is not in the sense of market share measures applied to relational databases over the last decade, but along the "Apache HBase will be used in a wider context and have a larger community than other NoSQL databases?" The main line to discuss (a little sophistry means).


Given that there are now more than 100 different NoSQL options available, including MongoDB, Riak, Couchbase, Cassandra and many other options, the above view can be said to be a bold inference. But in the big data age, the trend is shifting from professional information storage to large-scale heterogeneous data processing, so even popular schemes like MongoDB will be overtaken by HBase.


Why? MongoDB has obvious scalability issues, and with the rapid growth of hadoop usage, NoSQL solutions that integrate directly with Hadoop will have a significant advantage in size and popularity. HBase has a large and diverse community that connects all aspects: users, developers, multiple commercial vendors, cloud availability, and so on, such as the last one through Amazon Web Services (AWS).


in the history of development, there are many similarities between HBase and Cassandra. HBase was founded by Powerset in 2007 (the company was soon acquired by Microsoft), and at first it was part of Hadoop and then became a top-level project. Cassandra was first launched by Facebook in 2007, is open source, and subsequently became an Apache incubation project, and has now become a top project. HBase and Cassandra are multiple columns of key-value data repositories that are good at accepting and providing large datasets, with lateral scalability, robustness, and flexibility.


their architecture differs in design philosophy: Cassandra borrows many design elements from Amazon's DYNAMODB system, has a final conformance model and optimizes writes, and HBase is a cloned version of Google BigTable, Optimized read operation and strong consistency. An interesting evidence of the superiority of HBase is that Facebook, the founder of Cassandra, has used hbase instead of Cassandra in its interior.


from an application developer's point of view, HBase is better because it provides strong consistency and makes life easier. One misconception about final consistency is that it increases write speed: If there is a persistent write blocking that affects the wait time, the final result is a "final consistency tax" without its benefits.


Almost all nosql schemes have some technical limitations, such as the effect of compression on low latency, the inability to automatically fragment, reliability issues, and the long recovery cycle of nodes when they are down. Here in MapR, we have created a "future edition" Enterprise-class hbase that includes instantaneous recovery, seamless fragmentation and high availability, and it rejects compression. We incorporated it into the GA version labeled M7 in May 2013, and it is also available in the cloud via the AWS Elastic MapReduce.


last but not least, HBase owns-the legacy of a project that is a contribution to Hadoop-a powerful and reliable way to integrate the entire Hadoop ecosystem, including Apache Hive and Apache Pig.


in summary, HBase will become a dominant NoSQL platform in use case scenarios where rapid, small-scale updates and large-scale queries are needed. Recent improvements have also created architectural advantages for hbase, including the elimination of compression and the provision of truly decentralized collaboration.


Michael Hausenblas is the chief data Engineer for the EMEA region of the MAPR Technologies company. His working background is large-scale data integration research and development, advocacy and standardization.


opposing


Jonathan Ellis


co-founder & CTO,


DataStax


HBase is plagued by too many flaws


NoSQL includes several features, such as a graphics database and document storage, that are not available in hbase, and hbase lag behind the leader even in the type of partitioned row storage that it belongs to. The technical flaw can divide the HBase's failure use case into two main types: first is the engineering problem, if the time and the manpower sufficient, this problem may deal with, two is the architectural flaw, this is the design stratification plane inherent problem, therefore cannot repair.


Engineering Problem


--operation is complex and prone to failure. HBase deployment needs to be configured with the following files: Minimum zookeeper cluster, level hmaster, Level two hmaster,regionservers, Activity Namenode, standby Namenode,hdfs management, and Datanodes. Although hbase can be installed automatically, it is too difficult to install successfully without help, such as regionservers failure or a low-level namenode failure. HBase use requires a lot of expertise and even the need to know what to monitor. Only God can help you make regular backups.


--regionserver failover takes 10-15 minutes, hbase partitions into areas, and each zone is managed by Regionserver. Regionserver only allow a single failure for the area it manages. When it fails, you must select a new zone server, and you must write back the log of the server before the new server works.


-it's painful to develop with hbase. The HBase API is clumsy and Java-centric. Non-Java clients are demoted to second-level thrift or rest portals. In contrast to this is the Cassandra Query language, which provides developers with a fruitful development experience that is familiar to all languages.


-HBase community is disunity. The main line of Apache instability is widely known. Cloudera, Hortonworks, and advanced users maintain their own patch trees on the top floor. Leadership has been torn apart and there is no clear roadmap for development. Conversely, the Open-source Cassandra community has contributors from DataStax, Netflix, Spotify, Blue Kings Capital and other organizations, and no faction or branch.


Overall, since I was concerned about the NoSQL biosphere, the gap between HBase and other NoSQL platforms has grown. When I first evaluated them, I had decided that HBase had lagged behind CASSANDRA6 months on the project, but today this lag has been extended to about 2 years.


Architectural Flaws


-The design of master-oriented makes the operation of HBase very inflexible. By Regionserver Master routing all of the read and write means that hbase cannot make an active/active structure asynchronous replication between multiple data centers, and that you cannot assign workloads to individual replicators on a cluster. By contrast, Cassandra Peer-to-peer replication allows seamless integration of hadoop,solar and Cassandra without ETL, and allows you to do lightweight transactions when you need minimal linear consistency.


-failover means downtime. Many applications cannot accept even a minute of downtime, but this is an inherent problem with hbase design; Each regionserver is a single node failure. While a fully distributed design means that a replicator is down and can be recovered without any special action, the system still works properly with other replicators, and in the future it can incorporate the downtime of the replicators.


-HDFs is designed primarily to access large files in the form of streaming. HBase is built on distributed file systems optimized for batch analysis. This is a direct cause of hbase low performance, especially for reading, and is particularly true for reading on solid-state drives. Just as a relational database cannot optimize a B-tree engine designed for a quasi-large data workload 30 years ago, HDFs does not balance between the main purpose and the narrowing of key functionality:


--In a cluster, mix the common hard drive and the SSD, and fix the table on the media suitable for the workload.


-snapshots, incremental backups, and Point-in-time restores.


-compresses traffic to avoid peak application response time.


-Dynamically routing requests to the best performing replicator.


makes HBase's base HDFs more suitable for bulk analysis The design will ensure that hbase is still naturally unsuitable for high-speed, random-access workloads that are unique to the NoSQL market.


Jonathan Ellis is DataStax's CTO and co-founder, DataStax he fixed technical direction and led the Apache Cassandra Project as project leader.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.