Summary: Differences between Hive,hiveonspark and Sparksql

Last Update:2017-08-10 Source: Internet

Author: User

Keywords Cloud computing Hive sparksql hiveonspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive on Mapreduce Hive on Mapreduce execution Process

Execution process Detailed resolution

Step 1:ui (User interface) invokes the ExecuteQuery interface, sends a HQL query to Driver step 2:driver creates a session handle for the query, and sends the query statement to Compiler. Wait for it to parse and generate an execution plan step 3 and 4:compiler get the related metadata from the Metastore the: metadata is used to type the expression in the query tree and to adjust the partition based on the query predicate, build Plan step 6 ( 6.1,6.2,6.3): Compiler generated by the implementation plan is a phased DAG, each stage may involve map/reduce job, metadata operations, HDFS file operations, Execution Engine will be all stages of DAG submitted to the corresponding component execution. Step 7, 8 and 9: In each task (mapper/reducer), the query results are stored in HDFS as temporary files. Temporary files that hold query results are read directly from the HDFS by Execution Engine as the return content from the Driver Fetch API. Hive on MapReduce characteristic relational database, the load pattern of a table is enforced when data is loaded (the table loading mode refers to the file format in which the data is stored in the database), and if the data that is loaded does not conform to the pattern when it is loaded, the relational database refuses to load the data, which is called write-time mode , the write-time pattern checks the data schema for verification when data is loaded. Hive when loading data and relational databases, hive does not check the data while loading data, and does not change the loaded data file, while the check data format is performed during query operations, which is called "read-time mode." In practical applications, the write-time mode in the load data will be indexed to the column, the data compression, so the speed of loading data is very slow, but when the data is loaded well, we go to query the data, fast. But when our data is unstructured and the storage pattern is unknown, relational data manipulation is a much more troublesome scenario, and hive will be able to take advantage of it. Relational database is an important feature of a row or some rows of data can be updated, delete operations, hive** does not support the operation of a specific row, hive the operation of the data only supports overwriting the original data and append data * *. Hive also does not support transactions and indexes. Updates, transactions, and indexes are all characteristics of relational databases, and these hive are not supported, and are not intended to be supported, because the hive design is massive data processing, fullData scanning normal, for some specific data to operate the efficiency is very poor, for the update operation, hive is through the query of the original table data conversion finally stored in the new tables, this and the traditional database update operation is very different. Hive can also make a real-time query on Hadoop to do their own contribution, that is, and hbase integration, HBase can be a quick query, but hbase does not support SQL-class statements, then hive can give hbase to provide SQL parsing shell, You can manipulate the HBase database with a class SQL statement. Hive can be considered as a MapReduce package. The meaning of hive is to transform the SQL language which is easy to write and write in the business analysis into the complex MapReduce program, thus greatly reducing the threshold of Hadoop learning, so that more users can use Hadoop for data mining analysis. Comparison SQL HIVEQL ANSI SQL Support does not fully support update Update\insert\delete INSERT Overwrite\into TABLE Transaction Support does not support mode write mode read mode data save block device, local file system HDFS delay low Godo Insert does not support support subquery full support can only be used in the FROM clause view updatable read-only scalability Low-high data scale small size ... Sparksql Sparksql Introduction

Sparksql, formerly known as Shark, provided a quick-start tool for technicians familiar with the RDBMS but did not understand MapReduce, which was the only Sql-on-hadoop tool to run on Hadoop at the time. However, in the process of mapreduce calculation, a large number of intermediate disk landing process consumes a lot of I/O, reduce the operating efficiency, in order to improve the efficiency of sql-on-hadoop, Shark came into being, but also because shark too much reliance on hive (such as the use of hive syntax parser , query optimizer, etc.), 2014 Spark team to stop shark development, put all resources on Sparksql project

? Sparksql, as a member of the spark ecosystem, continues to develop, and is no longer limited to hive, but compatible with hive; Hive on Spark is a hive development plan that will spark as one of the underlying engines of hive, that is to say, Hive will no longer be restricted to an engine, and can be powered by Map-reduce, Tez, spark, etc.

Sparksql Two components

Sqlcontext:spark SQL provides all the relational functionality in the SqlContext encapsulation Spark. You can create sqlcontext with existing sparkcontext in the previous example. Dataframe:dataframe is a distributed collection of data organized in the form of named columns. Dataframe is based on the data frame concept in R language, similar to database tables in relational databases. You can convert Dataframe to RDD by calling the RDD method that returns the contents of the Dataframe as rows RDD (RDD of rows). You can create Dataframe from the following data sources: existing RDD, structured data files, JSON datasets, hive tables, and external databases. Sparksql Run the schema

Similar to relational database, Sparksql is also a statement by projection (A1,A2,A3), Data Source (TableA), Filter (condition), respectively, corresponding to the SQL query in the process of result, data Source, twist, which means that the SQL statement is described in the order of Operation–>data Source–>result.

When executing the order of the SPARKSQL statement

parse the Read SQL statement (Parse) to identify which words in the SQL statement are keywords (such as SELECT, from, WHERE), which are expressions, which are projection, and which are data source, To judge whether the SQL statement is normative;

projection: Simply the collection of columns selected by SELECT, reference: SQL projection bind SQL statements to database data dictionaries (columns, tables, views, and so on) (bind) if the associated projection, data Source, etc. exist, it means that the SQL statement can be executed; A typical database provides several execution plans, which typically have running statistics, where the database chooses an optimal plan (Optimize); plan execution (execute), This is done in the order of Operation–>data Source–>result, which sometimes returns results without even having to read the physical table, such as rerun the SQL statement that you just ran, and possibly get the return result directly from the database's buffer pool. Hive on Spark

? Hive on Spark was initiated by Cloudera, an open source project involving companies such as Intel, MAPR, and was designed to take spark as a computing engine for hive and submit hive queries as spark to the spark cluster for calculation. With this project, you can improve the performance of hive queries and provide a more flexible choice for users who have deployed hive or spark, thereby further increasing the coverage of hive and spark.

The difference between

Hive on Spark and Sparksql

? The hive on Spark is roughly similar to the SPARKSQL structure, except that the SQL engine is different, but the compute engine is spark! Knock on the blackboard! That's the point!

Let's take a look at the experience of using hive on spark in Pyspark

#初始化Spark SQL #导入Spark SQL from pyspark.sql import Hivecontext,row # When hive dependencies cannot be introduced # from pyspark.sql import Sqlcontext,row # Note that the above point is the key, he two from the same bag, you can distinguish how much hivectx = Hivecontext (SC) #创建SQL上下文环境 input = Hivectx.jsonfile (inputfile) #基本查询示例 Input.registertemptable ("tweets") #注册输入的SchemaRDD (Schemardd after Spark version 1.3 has been changed to Dataframe) #依据retweetCount (forwarding count) Select tweets Toptweets = Hivectx.sql ("Select Text,retweetcount from Twitter order by Retweetcount LIMIT 10")

We can see that SqlContext and Hivecontext are all from the Pyspark.sql package, which can be understood from here, in fact, Hive on Spark and sparksql don't make much difference

Structurally hive on spark and sparksql are a translation layer that translates one SQL into a distributed executable spark program. And everyone's engines are spark.

Both Sparksql and hive on Spark are solutions that implement SQL on Spark. Spark had previously shark projects to implement the SQL layer, but later the overthrow of the redo, became sparksql. This is the spark official Databricks project, the Spark project itself is the main push of the SQL implementation. Hive on spark a little later than Sparksql. Hive originally had no good support for engines outside MapReduce, while the hive on Tez project allowed Hive to support and spark the approximate calculates structure (not mapreduce dag). So on this basis, Cloudera-led initiated the hive on Spark. The project was supported by Ibm,intel and MAPR (but no databricks).

Hive on MapReduce and sparksql use scenes Hive on MapReduce scene Hive the presence of those proficient in SQL skills, but unfamiliar with MapReduce, Weak programming and poor Java language users can easily use SQL language to query, summarize, and analyze data in HDFs large datasets, and people who are proficient in SQL language are much more hive than proficient in the Java language. Sparksql scenarios for offline, non-real-time data Spark can run both local and standalone, cluster and other modes in yarn, Mesos, and also run in clouds such as EC2. In addition, Spark's data sources are very broad and can handle various types of data from HDFs, HBase, Hive, Cassandra, and Tachyon. Real-time requirements or high speed requirements for sites Hive on MapReduce and sparksql performance comparisons

Conclusion: Sparksql and Hive on spark the same time, but are much faster than hive on MapReduce, official data that spark will be 10-100 times faster than the traditional MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More