Discussion on the applicability of Hadoop, Spark, HBase and Redis (full text)
2014-06-15 11:22:03
url:http://datainsight.blog.51cto.com/8987355/1426538
Recently on the web, I saw a discussion about the applicability of Hadoop [1]. Think of this year's big data technology started by the Internet giants to the small and medium internet and traditional industries, it is estimated that many people are considering a variety of "complex" large data technology applicability of the problem. Here, I'll combine my years of experience with large data orientations such as Hadoop, and discuss the use of several major data technologies such as Hadoop, Spark, HBase, and Redis (first of all, the Hadoop referred to in this article is "narrowly" Hadoop, That is, the technique of running MapReduce directly on the HDFs, hereinafter.
I have actually studied and used large numbers of data (including NoSQL) in the past few years, such as Hadoop, Spark, HBase, Redis, and MongoDB, the common features of these technologies are not suitable for supporting transactional applications, especially those related to "money", such as "subscription Relationships", " Supermarket trading "and so on, these occasions have so far been oracle and other traditional relational database of the world.
1. Hadoop Vs. Spark
Hadoop/mapreduce and Spark are best suited for off-line data analysis, but Hadoop is especially suitable for a "big" amount of data in a single analysis, while Spark is suitable for scenarios where the volume of data is not very large. "Very large" here, is relative to the memory capacity of the entire cluster, because spark is the need to hold the data in memory. Generally, the amount of data below 1TB is not very large, and 10TB above the amount of data is considered "very large". For example, a cluster of 20 nodes (such clusters are small in large data areas), 64GB memory per node (not small, but not large), total 1.28TB. It is easy for a cluster of this size to hold about 500GB of data in memory. At this time, the execution speed with spark will be faster than Hadoop, after all, in the mapreduce process, such as spill and other operations are required to write the disk.
Here are 2 points to mention: 1 in general, for small and medium-sized Internet and enterprise-class large data applications, the number of single analysis will not be "very large", so you can give priority to the use of Spark, Especially when Spark is mature (Hadoop is 2.5, and spark is just out of 1.0). For example, China moved a provincial company (at the enterprise level, mobile companies, the amount of data is still quite large), their single analysis is generally hundreds of GB, even less than 1TB, not to mention more than 10TB, so you can fully consider using spark gradually replace Hadoop. 2 The business generally thinks that spark is more suitable for "iterative" applications such as machine learning, but this is only "more". In general, for medium-sized data, even if it is not a "more suitable" category of applications, Spark can also be fast 2~5 times. I have done a comparison test, 80GB compressed data (more than 200GB after decompression), 10 nodes of the cluster scale, running similar to the "sum+group-by" application, MapReduce spent 5 minutes , and spark only need 2 minutes .
2. HBase
For HBase, it is often heard that hbase is only suitable for supporting off-line analytical applications, especially as background data sources for mapreduce tasks. Hold this view a lot of, even in the domestic a resounding telecommunications equipment provider, HBase is also classified into the data analysis product line, and explicitly do not recommend hbase for online applications. But is this really the case? Let's take a look at some of its major cases: Facebook's messaging apps, including messages, Chats, emails, and SMS systems, are all hbase, and Taobao's web version of Ali is flourishing, and the backstage is hbase; millet's rice chat with is also hbase Mobile phone details of a company to query the system, last year, also from the original Oracle changed to a 32-node HBase cluster-brothers, these can be well-known large companies key applications Ah, enough to explain the problem.
In fact, from the technical characteristics of hbase, it is particularly suitable for simple data writing (such as "message Class" application) and a large number of simple structure data query (such as "Detailed single class" application). In the above mentioned 4 hbase application, the Facebook message, the Web Edition Ali Wangwang, the rice chats and so on all belong to the data writes the main message class application, but the mobile company's handset detailed single inquiry system belongs to the detailed single class application which mainly is the data query.
Another use of HBase is as a mapreduce background data source to support off-line analytical applications. This is certainly true, but its performance is questionable. For example, superlxw1234 students through the experiment compared to the "Hive over HBase" and "Hive over HDFS" after the surprise found [2], in addition to the use of Rowkey filter, based on the performance of the HBase is slightly better than directly based on HDFS outside, When using full table scans and filtering according to value, the performance of the direct based HDFS scheme is much better than the hbase-it's a fallacy. But for this question, I personally feel from the principle, when using Rowkey filter, the higher the filtration level, the performance of the HBase scheme is necessarily better, while the performance based on the HDFS scheme is not related to the degree of filtration.
3. HBase Vs. Redis
HBase and Redis are functionally similar, for example, they all belong to