Apache Tajo is a hadoop-based relational and distributed database warehouse system. At the beginning of its design, Tajo was designed to achieve low latency, scalability, and instant query through advanced database technologies, the database warehouse system that can be aggregated to make up for the shortcomings in real-time and relational transactions such as hadoop. Tajo also supports SQL standards, so you can operate on it through SQL. HDFS is the main storage layer of Tajo and has its own query engine (the code structure seems to have many codes written by itself). Therefore, HDFS can directly control various distributed executions (such as queries) and data streams. In this way, Tajo has many query control policies and can be used to optimize queries.
Function:
1. Fast and Low-latency queries, supporting various SQL operations, such as conditional queries, group by, sort, and join
2. ETL support
3. support various data formats, such as CSV, rcfile, rowfile (Row-based file storage), and trevni
4. You have your own command line interface, so you can use SQL to operate Tajo directly.
5. You can also use Java client to operate Tajo directly.
1. Background
Currently, there are many SQL engines on hadoop. In summary, there are two types of systems:
(1)Convert SQL into mapreduce. A typical example is Apache hive, which features good scalability and fault tolerance, but low performance. To make up for the shortcomings of SQL on mapreduce, Google proposed Tenzing (see reference [3]). Unlike hive, Tenzing fully draws on the advantages of mapreduce and database. First, it optimizes traditional mapreduce (for example, map can not write disks, reduce can not sort) to improve its performance, one of the major advantages of using mapreduce is that Tenzing has good scalability and fault tolerance. Tenzing is described as follows:
"Thanks to mapreduce, Tenzing scales to thousands of cores and petabytes of data on cheap, unreliable hardware. We worked closely with the mapreduce team to implement and take advantage of mapreduce optiations ."
Secondly, it draws on the advantages of traditional databases and is embedded with a cost-based optimizer to fully optimize the SQL query plan.
(2)Using distributed databases for Reference. Typical examples are Google dremel, Apache drill, and cloudera impala, which features high performance (compared with hive and other systems), but Scalability (including cluster Scale Expansion and SQL type support diversity) and poor fault tolerance. Google described the applicable scenarios of dremel in the dremel paper (see reference [4]) as follows:
"Dremel is not intended as a replacement for Mr and is often used in conjunction with it to analyze outputs of Mr pipelines or Rapidly prototype larger computations ."
That is to say, dremel is not used to replace Mr, but to make up for Mr deficiency. It is usually used to analyze the data produced by Mr (when the data volume is small, low requirements for SQL expressions and framework error tolerance ).
Apache Tajo (For details, refer to [1] [2], Tajo PPT download, and Tajo paper download) is an open-source yarn-based distributed data warehouse of the Korean University database laboratory, it is currently a Level 2 project of Apache. Tajo's design philosophy is similar to Tenzing. It fully draws on the advantages of mapreduce and database, so that it has the advantages of hive scalability and good fault tolerance, but at the same time, its performance is much higher than hive.
2. Tajo design architecture
Tajo adopts the master-worker architecture, which is as follows:
(1) tajomaster: Provides the query service for the client and manages each querymaster.
(2) querymaster: parses, optimizes, and executes a query. It works with Multiple Task runner worker to compute a query.
As shown in, Tajo has developed an SQL parser using traditional database technology, including SQL parsing, generating query plans, optimizing query plans, and executing query technologies. However, Tajo is different from traditional databases, when Tajo finally executes the query plan, it draws on the mapreduce design idea and converts the query plan into a series of tasks. In this way, the execution of the query plan is actually to execute these tasks, each task is a computing unit. Like map tasks and reduce tasks, it can be executed repeatedly and has progress reports. In this way, tajo can directly use the fault tolerance and speculative execution mechanisms in mapreduce. In addition, Tajo uses yarn for resource management.
I introduced tez in the previous blog "Apache tez: A computing framework running on yarn that supports Dag jobs", where I talked about hive + Tez, hive optimized by tez is a very promising project. Tajo also talked about the possibility of using tez as the underlying computing framework in the future:
Besides, tez has some overlapping functions with Tajo. however, tez is in the pre-Alpha stage and may be a prototype. when tez becomes feasible, Tajo cocould use tez as an underlying framework according to the applicability. however, Tajo will still use its row/native columnar execution engine and its optimizer. tajo may be potentially the first application of Tez.
3. Summary
Hive systems such as Tenzing and Tajo may be replaced, rather than dremel or Impala systems. The latter is far inferior to hive/Tenzing/Tajo in terms of scalability, SQL expression capability (mainly caused by its nested storage model), and fault tolerance, dremel is usually used in combination with mr. The design motivation is not to replace Mr, but to make computing more efficient in some scenarios. In addition, dremel and Impala are computing systems that require computing resources but are not integrated into the currently developing resource management system yarn. This means that if impala is used, you can only build an independent private cluster and cannot share resources. Even if impala is mature, if hive's substitute products (such as Tajo) are not mature, most companies still use hive for a long time (at this time, hive + tez of hortonworks is useful) big Data Processing, while impala is only used to further process hive output results or applications suitable for a certain type of scenarios (after all, these systems have limited SQL expression capabilities, fault tolerance and poor scalability ).
As far as Tajo is concerned, the activity is very low. Only a few people from the database lab of Korea University are developing it, which is still a long time away from real availability, but it has already taken the first step, it becomes an Apache project, allowing more people to participate.
From: http://dongxicheng.org/mapreduce-nextgen/apache-tajo/