Sparksql is essentially a DAG model-based MPP. And the Kylin core is cube (Multidimensional cube). For the difference between MPP and Cube preprocessing, repeat as follows:
The basic idea of > MPP [1] is to increase the number of machines for parallel computing, thus increasing query speed. For example, scanning 800 million records a machine to be processed for 1 hours, but if it is handled in parallel with 100 machines, it will take less than a minute. With Columnstore and some indexes, queries can be returned more quickly. Note that there is no reduction in the amount of online computing, 800 million records are to be scanned once, only the number of participating machines, so fast.
> MOLAP cube [2][3] is a pre-computing technology, the basic idea is to pre-dimensional data indexing, query only scan index without accessing the original data to speed up. 800 million records of a 3-dimensional index may have only tens of thousands of records, the scale is greatly reduced, so the online calculation is greatly reduced, the query can be very fast. Index tables can also be used in the form of column storage, parallel scanning and other MPP commonly used techniques. But multi-dimensional index to the various groups of multi-dimensional cooperation is expected, the offline index requires a large amount of computation and time, the final index will also occupy more disk space.
In addition to having no preprocessing differences, Sparksql and Kylin have different preferences for dataset size. If the data can be basically put into memory, Spark's memory cache will give Sparksql a good performance. However, for ultra-large datasets, spark cannot avoid frequent disk reads and writes, and performance can drop dramatically. In turn, Kylin's cube preprocessing significantly reduces online data size, and is more advantageous for ultra-large data.
http://wenda.chinahadoop.cn/question/867
What are the differences and advantages of Kylin compared to Spark SQL?