Original address: http://www.parallellabs.com/2013/08/25/impala-big-data-analytics/
Wen/Shing Chen Guanxing
Big data processing is a very important issue in cloud computing, and since Google has proposed a mapreduce distributed processing framework, the open source software represented by Hadoop is being valued and favored by more and more companies. Based on Hadoop, systems such as Hbase,hive,pig have mushroomed into the ecosystem of Hadoop. Today we're going to talk about a new member of the Hadoop system, –impala.Impala Architecture Analysis
Impala is a new query system led by Cloudera, which provides SQL semantics to query petabytes of big data stored in Hadoop's HDFs and HBase. Although the existing hive system also provides SQL semantics, it is still a batch process that makes it difficult to satisfy the interactivity of the query because it uses the MapReduce engine, and the biggest feature of Impala is its rapid performance. So how does Impala achieve fast query of big data? Before answering this question, we need to introduce Google's Dremel system , because Impala was first designed with reference to the Dremel system.
dremel is Google's interactive data analysis system, built on Google's GFS (Google File system) and other systems, Support Google's data Analysis Services BigQuery and many other services. The main technical highlights of Dremel are two: one is the implementation of nested data column storage, the second is the use of multi-layered query tree, so that the task can be executed in parallel and aggregated results on thousands of nodes. Column storage in the relational database is not unfamiliar, it can reduce the amount of data processed during the query, effectively improve query efficiency. Dremel's column storage differs in that it is not about traditional relational data, but on nested structures. Dremel can transform the records of a nested structure into a columnstore form, query the required columns according to the query criteria, then conditionally filter, output and then assemble the column into a nested structure of the record output, the record forward and reverse conversion are achieved through an efficient state machine. On the other hand, Dremel's multi-level query tree draws on the design of the distributed search engine, the root node of the query tree is responsible for receiving the query, and distributes the query to the next layer node, the bottom node is responsible for the specific data reading and query execution, and then returns the result to the upper node. For more information on the implementation of the Dremel technology, readers can refer to .
Impala is actually the Dremel,impala used by Hadoop, the Columnstore format is parquet. Parquet implements column storage in the Dremel, and in the future will support hive and add functions such as dictionary encoding, run-length encoding, and so on. The system architecture of Impala is shown. Impala uses the SQL interface of hive, which includes operations such as Select,insert,join, but currently only implements a subset of the SQL semantics of hive (for example, the UDF has not yet been supported), and the metadata information for the table is stored in hive's metastore. Statestore is a sub-service of Impala that monitors the health of each node in a cluster, providing node registration, error detection, and more. Impala runs a background service Impalad,impalad on each node to respond to external requests and to complete the actual query processing. The Impalad mainly contains three modules for query Planner,query Coordinator and query Exec engine. Querypalnner receives queries from the SQL app and ODBC, and then transforms the query into many subqueries, query Coordinator distributes the subqueries to each node, and the query Exec engine on each node is responsible for the execution of the subquery. Finally, the result of the subquery is returned, and these intermediate results are eventually returned to the user after aggregation.
Figure 1. Impala's system architecture diagram 
In Cloudera's test, Impala's query efficiency was increased by an order of magnitude compared to hive. From a technical point of view, Impala has a good performance, the main reasons are as follows:
1) Impala does not need to write intermediate results to disk, eliminating a large amount of I/O overhead.
2) eliminates the overhead of mapreduce job startup. The MapReduce start task is very slow (by default, each heartbeat interval is 3 seconds), and Impala works directly through the corresponding service process, much faster.
3) Impala completely abandoned mapreduce This is not very suitable to do SQL query paradigm, but like Dremel the idea of the MPP parallel database, from a new start, so you can do more query optimization, so as to eliminate unnecessary shuffle,sort and other expenses;
4) The use of LLVM to compile the runtime code uniformly, avoids the unnecessary cost to support the general compilation;
5) with C + + implementation, has done a lot of targeted hardware optimization, such as the use of SSE instructions;
6) I/O scheduling mechanism is used to support data locality, and it can be distributed to the same machine as possible, which reduces the network overhead.
Although Impala is a reference to Dremel, Impala has its own features, such as Impala not only supports the Parquet format, but also directly handles text, Sequencefile and other commonly used file formats in Hadoop. Another key point is that Impala is open source, plus Cloudera's leadership in the Hadoop world, and its ecosystem is likely to grow rapidly in the future. It is foreseeable that in the near future, Impala is likely to be as much of a big data-processing field as previous Hadoop and hive. Cloudera himself also said he expected future Impala to replace Hive completely. Of course, it takes time for users to migrate from hive to Impala, and Impala has just released version 1.0, although it is known to be able to run stably on a production environment, but there is still much room for improvement . It is necessary to note that Impala is not intended to replace an existing mapreduce system, but rather as a powerful addition to MapReduce, in general Impala is suitable for processing output data in moderate or relatively small queries, and for large data volume of the batch task, MapReduce is still a better choice. Another lace message is that Cloudera, the architect in charge of Impala Marcel Komacker, has been responsible for the development of the query engine for F1 system in Google, which shows that Google does contribute to big Data's popularity.Comparison of Impala with Shark,drill, etc.
In addition, the University of California, Berkeley Amplab also developed a big data analysis system called Shark. In today's June "programmer" There is a special analysis of shark related to the Spark system article, interested readers can refer to friends. In the long run, Shark wants to be an integrated data processing system that supports both large-database SQL queries and advanced data analysis tasks. From the point of view of technology implementation, shark based on the Scala language operator derivation realizes the good fault-tolerant mechanism, so the long task and the short task can recover quickly from the last "snapshot point". In contrast, Impala, because of the lack of a sufficiently robust fault tolerance mechanism, will have to "start over" once the task runs on it fails, and this design is bound to be lacking in performance. And shark is a system design that makes memory the first class of storage media, so there are some advantages in processing speed . In fact, Amplab recently conducted a comparative experiment with Hive,impala,shark and Amazon's commercial MPP database redshift, in the scan query,aggregation Query and join They are compared in query three types of tasks. Figure 2 is the performance comparison of aggregation query in the Amplab report. As we can see in the figure, the commercial version of the redshift is the best, and impala and Shark each have a winner, and both have a higher performance than hive. For more relevant experimental results readers can refer to .
Figure 2. Comparison of aggregation query performance between Redshift,impala,shark and Hive 
To the author humble opinion, in fact, the big data analysis of the project, technology is often not the most critical. For example, both MapReduce and HDFs in Hadoop originate from Google and are less original. In fact, the ecosystem of open source projects, community, development speed and so on, often will affect the development of the open-source Big data analysis system such as Impala and shark to a large extent. Just as Cloudera decided to open up Impala in the first place, hoping to use the power of the open source community to promote the product; shark was open source from the start, not to mention Apache's drill. In the final analysis, the question of who's ecosystem is stronger. Technically, a momentary lead is not enough to guarantee the ultimate success of the project. While it's hard to say that the final product will be the de facto standard, the only thing we can be sure of and firmly believe is that big data analytics will continue to grow as new technologies evolve, which is always a blessing for users. For example, if readers notice the evolution of next-generation Hadoop (yarn), yarn already supports computational paradigms outside of MapReduce (such as Shark,impala, etc.), so Hadoop will likely exist as an all-embracing platform in the future, In the provision of a variety of data processing technology, have to deal with the second magnitude of query, there are big data batch processing, a variety of functions, to meet the needs of users.Future prospects
In fact, in addition to impala,shark,drill such an open source solution, such as ORACLE,EMC and other traditional manufacturers are not sitting idly waiting for their market by open source software misappropriation. Like EMC, the HAWQ system has been launched, claiming its performance is more than 10 times times faster than Impala, and the aforementioned Amazon redshift offers better performance than Impala. Although open source software because of its strong cost advantage and has a very strong power, but the traditional database manufacturers will still try to introduce performance, stability, maintenance services and other indicators more powerful products to compete with it, and participate in the open source community, leveraging open source software to enrich their product line, enhance their competitiveness, And through more high value-added services to meet certain consumer needs. After all, these vendors often have accumulated a lot of technology and experience in traditional fields such as parallel databases, which are still very deep. Even now there is a newsql system like NuoDB (a start-up company) that claims to support acid and scalability. On the whole, the future of Big data analysis technology will become more and more mature, cheaper and more easy to use, corresponding to the user will be more easily from their big data to dig out valuable business information.Resources
 Impala key Issues List: http://yuntai.1kapp.com/?p=1089
 Hive principle and insufficiency: http://www.ccplat.com/?p=1035
 impala/hive status analysis and prospect: http://yanbohappy.sinaapp.com/?p=220
 What's next for Cloudera impala:http://blog.cloudera.com/blog/2012/12/whats-next-for-cloudera-impala/
 MapReduce: A huge setback: Http://t.cn/zQLFnWs
 Google Dremel principle-How to analyze 1pb:http://www.yankay.com/google-dremel-rationale/in 3 seconds
[Ten] Isn ' t Cloudera Impala doing the same job as Apache Drill incubator project? Http://www.quora.com/Cloudera-Impala/Isnt-Cloudera-Impala-doing-the-same-job-as-Apache-Drill-incubator-project
Big Data benchmark:https://amplab.cs.berkeley.edu/benchmark/
 How does Impala compare to Shark:http://www.quora.com/apache-hadoop/how-does-impala-compare-to-shark
[ EMC explains HAWQ SQL performance: Left-handed hive right hand impala:http://stor-age.zdnet.com.cn/stor-age/2013/0308/2147607.shtmlAbout the author
Shing, Ph. D. In computer science, Tsinghua University, major research interests include new applications in big data processing and cloud computing, and design and optimization of distributed systems under new scenarios.
Chen Guanxing, a researcher at IBM China Research Institute, focuses on the design of hardware and software in large-scale distributed systems. Personal blog for the Parallel Lab (www.parallellabs.com), Sina Weibo @ Guan Cheng.
Impala: A new generation of Big data analytics engine for open source-reproduced