Comparison of Sparksql and hive on spark

Source: Internet
Author: User
Tags mapr databricks

This paper briefly introduces the difference and connection between sparksql and hive on Spark.

first, about Spark

Brief introduction

In the entire ecosystem of Hadoop, Spark and MapReduce are at the same level, solving the problem of the distributed computing framework primarily.

Architecture

The architecture of Spark, as shown, consists of four main components: Driver, Master, worker, and executor.

Spark features

· Spark can be deployed on yarn

· Spark native supports access to the HDFs file system

· Using the Scala language to write

Deployment model

1. Stand-alone model: Mainly used for development testing. Features: Driver, Master, worker, and executor all run in the same JVM process.

2. Pseudo-cluster model: Primarily used for development testing. Features: Master, worker are running in the same JVM process; master, worker, and executor all run on the same machine and cannot run across machines;

3. Standalone cluster (also known as native cluster mode): Can be used in production environments where the cluster size is not very large. Features: Master, worker, and executor all run on separate JVM processes.

4. Yarn cluster: The applicationmaster role in yarn ecology, using the Apache developed Spark Applicationmaster instead, The NodeManager role in each yarn ecosystem is equivalent to a worker role in the spark ecosystem, and Nodemanger is responsible for executor startup.

5. Mesos cluster: No detailed research.

Ii. about Spark SQL

Brief introduction

It is primarily used for structured data processing and for executing SQL-like queries against spark data. With Spark SQL, you can perform ETL operations (such as Json,parquet, databases) for data in different formats and then complete specific query operations. In general, Spark introduces a new context and the appropriate RDD for each support of a new application development, and for SQL this feature is SqlContext and Schemardd. Note: After Spark1.3, Schemardd has been renamed Dataframe, but it is essentially like an rdd, because Dataframe can be seamlessly converted into an rdd.

Architecture

Spark is good at supporting SQL, to complete the parsing (parser), optimization (optimizer), execution (execution) three major processes.

The processing order is roughly as follows:

1. Sqlparser Generate Logicplan Tree;

2. Analyzer and optimizer the various rule actions on the Logicalplan Tree;

3. The final optimization generated Logicalplan generated sparkrdd;

4. Finally, the resulting rdd is handed to spark for execution;

Spark SQL the two components

1. Sqlcontext:spark SQL provides all the relational features in the SqlContext package Spark. You can create a sqlcontext with an existing sparkcontext in the previous example.

2. Dataframe:dataframe is a distributed collection of data organized in the form of named columns. Dataframe is based on the data frame concept in the R language, similar to a database table in a relational database. You can convert Dataframe to Rdd by calling the Rdd method that returns the contents of Dataframe as the line Rdd (Rdd of rows). Dataframe can be created from the following data sources: existing RDD, structured data files, JSON datasets, hive tables, external databases.

Using the example

Write a simple Scala program that loads user data from a text file and creates a Dataframe object from the dataset. Then run the Dataframe function to perform a specific data selection query.

The contents of the text file Customers.txt are as follows:

tom,12

mike,13

Tony,34

lili,8

david,21

nike,18

bush,29

candy,42

Writing Scala Code:

Import Org.apache.spark._

Object Hello {

Create a custom class that represents the user

Case class Person (name:string, age:int)

def main (args:array[string]) {

Val conf = new sparkconf (). Setappname ("Sparksql Demo")

Val sc = new Sparkcontext (conf)

First create the SqlContext object with the existing Spark context object

Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)

Import statement that can implicitly convert an RDD to Dataframe

Import Sqlcontext.implicits._

Create a dataframe of a person object with a dataset text file

Val people = Sc.textfile ("/users/urey/data/input2.txt"). Map (_.split (",")). Map (p = person (p (0), p (1). Trim.toint)) . TODF ()

Registering Dataframe as a table

People.registertemptable ("People")

SQL query

Val teenagers = sqlcontext.sql ("Select name, age from People WHERE age >= and age <= 19")

Output query results, sequentially accessing the columns of the result row.

Teenagers.map (t = "Name:" + t (0)). Collect (). foreach (println)

Sc.stop ()

}

}

As shown above, Spark SQL provides a very friendly SQL interface to interact with data from a variety of different data sources, and the syntax used is also well-known by the team for SQL query syntax. This is useful for non-technical project members, such as data analysts and database administrators.

Summary

We learned how Apache Spark SQL provides SQL interfaces for interacting with spark data with well-known SQL query syntax. Spark SQL is a powerful library that enables non-technical team members in your organization, such as business analysts and data analysts, to perform data analysis with Spark SQL.

Iii. about Hive on Spark

Background

Hive on Spark is an open source project initiated by Cloudera, which is co-engaged by companies such as Intel and MAPR, and is designed to use spark as a compute engine for hive and to submit the hive query as a spark task to the spark cluster for computation. With this project, you can improve the performance of hive queries while providing more flexible options for users who already have hive or spark deployed, further increasing the penetration of hive and spark.

Brief introduction

Hive on Spark is evolving from hive on MapReduce, the overall solution for hive is good, but the query submission to the result return takes a long time, the main reason is because Hive native is based on MapReduce, So if we do not generate a mapreduce job, but instead generate a spark job, we can take advantage of Spark's fast execution capability to shorten the response time of HIVEQL.

Hive on Spark is now part of the Hive component (after Hive1.1 release).

with the Sparksql the Difference

Both Sparksql and hive on Spark are solutions that implement SQL on Spark. Spark had an earlier shark project to implement the SQL layer, but later, when the override was done, it became sparksql. This is the project of the spark official databricks, which the Spark project itself is the main push of the SQL implementation. Hive on Spark is slightly later than Sparksql. Hive did not support the engine beyond MapReduce, and the hive on Tez project allows hive to support the planning structure (non-mapreduce dag) that is similar to spark. So on this basis, Cloudera led the hive on Spark. This project was supported by Ibm,intel and MAPR (but not databricks).

Using the example

Roughly similar to the SPARKSQL structure, except that the SQL engine is different. Some of the core code is as follows:

Val hivecontext = new Hivecontext (SC)

Import Hivecontext._

HQL ("CREATE TABLE IF not EXIST src (key INT, value STRING)")

HQL ("LOAD DATA LOCAL PATH '/users/urey/data/input2.txt ' into TABLE src")

HQL ("From SRC SELECT key, value"). Collect (). foreach (println)

Summary

The structure of hive on spark and sparksql is a translation layer that translates a SQL into a distributed executable spark program. For example, a sql:

SELECT item_type, sum (price)

From item

GROUP Item_type;

The above SQL script is given to hive or a similar SQL engine, which will "tell" the compute engine to do the following two steps: Read the Item table, extract the Item_type,price two fields The initial sum (in fact, each individual price as its sum) is calculated on price because group by says it needs to be grouped according to Item_type, so set Shuffle key to Item_type from the first set of nodes and distribute to the aggregation node Having the same item_type aggregated to the same aggregation node, and then adding the partial sum of each group together, gives the final result. Either hive or sparksql is generally doing the work above.

It is to be understood that hive and Sparksql are not responsible for calculations, they just tell spark that you need to do that, but you are not directly involved in the calculation.


Reference: http://blog.csdn.net/yeruby/article/details/51448188

Comparison of Sparksql and hive on spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.