Before yarn, Hadoop was only available for offline processing scenarios. Based on real-time demand, organizations have developed their own streaming framework, this time we are talking about two sql-on-hadoop projects, as well as two well-known Hadoop solution Providers--impala vs. Stinger.
Singer:stinger first appeared in Hive 0.11 (HDP 1.3), with a total of 3 phase goals, of which phase I and II had been delivered. Through Hortonworks's introduction, the first phase delivers 35-45 times the speed of all types of analysis, and the second phase delivers an additional 5-10-fold increase in performance.
Impala:impala released at the end of 2012, Google Dremel's open source implementation, developed by a renowned Hadoop solution provider Cloudera, is one of the most popular streaming frameworks of the moment. Cloudera's intention to develop the Tibetan antelope is clear--to improve the speed of hive SQL queries, In the 1.0 beta release is more claimed to be 3-90 times faster than Hive, and after the Impala official release, Cloudera said its concurrent execution of client processing speed even beyond the single machine hive.
Mesos, yarn, and other cluster resource management tools have led to direct competition between Stinger and Impala, and Cloudera's benchmark based on Tpc-ds.
Impala vs. Stinger
The test contrast version is Impala 1.1.1 and Hive 0.12 (integrated stinger), hive runs on the Orcfile dataset, Impala uses Parquet to store the same data. In order for hive to get the best performance, Cloudera also converted the Tpc-ds query into a SQL-92 join, optimized the join order manually, specified the partitioning field, and Impala did the same optimization.
The data size is 3TB, using a typical 5 Hadoop data node configuration. The query also uses a variety of types, includes a variety of standard joins and aggregations, and uses complex multi-level aggregations and subqueries.
The result of the test is that the Impala is 6-69 times faster than the hive, and the types include the following:
Written in the last
Here, you might have a question that benchmark tests that are 10 times times faster or even more than hive are seen everywhere, even between these tools, such as the following two:
HAWQ contrast hive and Impala (see article for more details)
Shark contrast hive and Impala (see blog for more details)
So what does this contrast mean? In fact, these should be due to the yarn after the launch of the opportunities and challenges: opportunities, the new resource manager so that different types of processing framework can run on the same Hadoop cluster, in this golden boom of the ecological circle, each more than a share of the benefits of self-evident; Yarn's new features allow more natural integration tools to improve performance, such as Stinger, so at the integration disadvantage they have to jump out to yell a few words, also appeared this and hive, in fact, is compared with Stinger performance. This shows that although the 2.0 version of the Hadoop biosphere has become more prosperous, but the pressure is self-evident.