Hadoop ecosystem has developed rapidly in recent years, and it contains more and more software, and it also drives the prosperity and development of the peripheral system. Especially in the field of distributed computing, the system is numerous and diverse, from time to time a system, claiming to be more efficient than mapreduce or hive dozens of times times, hundreds of times times. There are some ignorant people who always follow the impala and say that the replacement of Hive,spark will replace the Hadoop MapReduce. This article fires from the problem domain, explaining the unique role/charm of each system in Hadoop and their irreplaceable nature.
Hadoop as an ecosystem, each system to solve only a specific problem domain (or perhaps very narrow), which is the charm of Hadoop: not to engage in a unified universal system, but small and fine multiple small systems. This paper focuses on the problem domains that can be solved by several open source systems in the Distributed computing field.
Reference MapReduce: The old distributed computing framework, which is characterized by expansibility, fault-tolerant, easy programming, suitable for off-line data processing, not good at streaming, memory computing, interactive computing and other fields.
Article Hive: A mapreduce in a SQL jacket. Hive is for the convenience of the user to use MapReduce and bread out of a layer of SQL, because the hive used SQL, its problem domain is narrower than the mapreduce, because many problems, SQL expression does not come out, such as some data mining algorithms, recommended algorithms, image recognition algorithms, These are still only accomplished by writing MapReduce.
(3) Pig: MapReduce, a scripting language jacket, uses a more expressive scripting language Pig in order to break the limit of Hive's ability to express SQL. Thanks to the powerful expressive power of pig language, Twitter has even implemented a large-scale machine learning platform based on pig (referring to Twitter's SIGMOD2012 article "large-scale Machine Learning at Twitter").
(4) Stinger Initiative (Tez optimized Hive): Hortonworks Open Source a DAG computing framework Tez, which can be used to design mapreduce applications like DAG, but it should be noted that Tez can only run on yarn. An important application of tez is to optimize the typical DAG scenario of hive and pig by reducing data read/write IO and optimizing the DAG process to provide many times the hive speed.
(5) Spark: To improve the computational efficiency of MapReduce, Berkeley developed Spark,spark as a memory-based MapReduce implementation, and Berkeley also packs a layer of SQL on the Spark basis, creating a new hive-like system shark, But at present spark and shark still belong to the laboratory product. Spark website is: http://spark-project.org/
(6) Storm/s4:hadoop in the field of real-time computing/streaming computing (MapReduce assumes that the input data is static, the processing can not be modified, and the flow calculation assumes that the data source is flowing, the data will flow into the system), has been relatively backward, fortunately, Twitter open source Storm and Yahoo! Open source S4 make up for this shortcoming, Storm in Taobao, MEDIAV and other companies have been widely used.
(7) Cloudera impala/apache drill:google Dremel Open Source implementation, perhaps because the interactive computing demand is too strong, the development of rapid, Impala only a year or so to launch the 1.0GA version. This system is suitable for interactive processing of the scene, the final output of the data must be less. Despite the release of version 1.0, Impala has a long way to go in terms of fault tolerance, extensibility, support for custom functions, and so on.
Hortonworks application requirements are divided as follows:
Mapping to several systems above, you know:
(1) Real-time application Scene (0~5s): Storm, S4, Cloudera impala,apache drill, etc.
(2) Interactive Scene (5s~1m): This scenario can usually require the support of SQL, the feasible system is: Cloudera Impala, Apache Drill, shark, etc.
(3) Non-interactive Scene (1m~1h): Usually run for a long time, processing data volume is large, the fault tolerance and scalability requirements are high, feasible systems are: MapReduce, Hive, Pig, Stinger, etc.
(4) Batch processing scene (1h+): Usually the running time is very long, the processing data quantity is very big, to the fault tolerant and the expansibility request is very high, the feasible system has: MapReduce, Hive, Pig, Stinger and so on.
This article links to address: http://dongxicheng.org/mapreduce-nextgen/rethinking-hadoop-from-problems-solved