Spark SQL with Hive

Last Update:2014-07-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article was a primer on spark SQL and introduced some basics and APIs, but it seemed a step away from our daily use.

There are 2 uses for ending shark:

1. There are many limitations to the integration of Spark programs

2. The Hive Optimizer is not designed for spark, and the computational model is different, making the Hive optimizer to optimize the spark program to encounter bottlenecks.

Here's a look at the infrastructure of Spark SQL:

Spark1.1 will support the spark SQL CLI when it is released, and the CLI of the spark SQL will require that it be connected to a Hive Thrift server to implement functionality similar to Hive shell. (PS: Branch-1.0-jdbc currently in Git.) There is no official release, I measured the afternoon, found there is still a bug, patience waiting for release it! ）

In a research mindset, you want to integrate with the hive environment and execute hive statements in the spark shell.

One, compile Spark support hive

There are 2 types of SBT compilation methods for Spark support hive:

1, SBT before adding variable name

Spark_hadoop_version=0.20.2-cdh3u5 spark_hive=true SBT/SBT Assembly

2. Modify the Project/sparkbuild.scala file

Val default_hadoop_version = "0.20.2-cdh3u5" val default_hive = True Then execute SBT/SBT assembly

Spark SQL Operations Hive Front: Hive is available, and under spark-env.sh, you need to match hive's Conf and Hadoop conf to classpath.

Start Spark-shell

[Email protected] spark]# Bin/spark-shell--master spark://10.1.8.210:7077--driver-class-path/app/hadoop/ Hive-0.11.0-bin/lib/mysql-connector-java-5.1.13-bin.jar:/app/hadoop/hive-0.11.0-bin/lib/hadoop-lzo-0.4.15.jar

Import Hivecontext

scala> val hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) Hivecontext: Org.apache.spark.sql.hive.HiveContext = [email protected]scala> import Hivecontext._import hivecontext._

Hivecontext provides a function to execute SQL HQL (string text)

Go to hive and show databases. Here Spark will parse hql and then generate query Plan. However, queries are not executed here, only when the collect is called.

scala> val show_databases = hql ("Show Databases") 14/07/09 19:59:09 INFO storage. Blockmanager:removing broadcast 014/07/09 19:59:09 INFO storage. Blockmanager:removing block broadcast_014/07/09 19:59:09 INFO parse. Parsedriver:parsing command:show databases14/07/09 19:59:09 INFO parse. Parsedriver:parse completed14/07/09 19:59:09 INFO analysis. Analyzer:max iterations (2) reached for batch multiinstancerelations14/07/09 19:59:09 INFO analysis. Analyzer:max iterations (2) reached for batch caseinsensitiveattributereferences14/07/09 19:59:09 INFO analysis. Analyzer:max iterations (2) reached for batch Check analysis14/07/09 19:59:09 INFO storage. Memorystore:block broadcast_0 of size 393044 dropped from memory (free 308713881) 14/07/09 19:59:09 INFO broadcast. Httpbroadcast:deleted broadcast file:/tmp/spark-c29da0f8-c5e3-4fbf-adff-9aa77f9743b2/broadcast_014/07/09 19:59:09 INFO SQL. sqlcontext$ $anon $1:max Iterations (2) reached for batch ADD exchange14/07/09 19:59:09 INFO sql. Sqlcontext$ $anon $1:max Iterations (2) reached for batch Prepare expressions14/07/09 19:59:09 INFO Spark. contextcleaner:cleaned broadcast 014/07/09 19:59:09 INFO QL. Driver: <perflog method=driver.run>14/07/09 19:59:09 INFO QL. Driver: <perflog method=timetosubmit>14/07/09 19:59:09 INFO QL. Driver: <perflog method=compile>14/07/09 19:59:09 INFO exec. Listsinkoperator:0 finished. Closing ... 14/07/09 19:59:09 INFO exec. listsinkoperator:0 forwarded 0 rows14/07/09 19:59:09 INFO QL. Driver: <perflog method=parse>14/07/09 19:59:09 INFO parse. Parsedriver:parsing command:show databases14/07/09 19:59:09 INFO parse. Parsedriver:parse completed14/07/09 19:59:09 INFO QL. Driver: </perflog method=parse start=1404907149927 end=1404907149928 duration=1>14/07/09 19:59:09 INFO QL. Driver: <perflog method=semanticanalyze>14/07/09 19:59:09 INFO QL. Driver:semantic analysis completed14/07/09 19:59:09 INFO QL. Driver: </perflog method=semanticanalyze start=1404907149928 end=1404907149977 duration=49>14/07/09 19:59:09 INFO exec. Listsinkoperator:initializing Self 0 op14/07/09 19:59:09 INFO exec. Listsinkoperator:operator 0 OP initialized14/07/09 19:59:09 INFO exec. Listsinkoperator:initialization done 0 op14/07/09 19:59:09 INFO QL. Driver:returning Hive Schema:schema (Fieldschemas:[fieldschema (Name:database_name, type:string, Comment:from Deserializer)], properties:null) 14/07/09 19:59:09 INFO QL. Driver: </perflog method=compile start=1404907149925 end=1404907149980 duration=55>14/07/09 19:59:09 INFO QL. Driver: <perflog method=driver.execute>14/07/09 19:59:09 INFO QL. Driver:starting command:show databases14/07/09 19:59:09 INFO QL. Driver: </perflog method=timetosubmit start=1404907149925 end=1404907149980 duration=55>14/07/09 19:59:09 INFO Ql. Driver: <perflog method=runtasks>14/07/09 19:59:09 INFO QL. Driver: <perflog method=task. Ddl. stage-0>14/07/09 19:59:09 INFO Metastore. hivemetastore:0: get_all_databases14/07/09 19:59:09 inFO hivemetastore.audit:ugi=root ip=unknown-ip-addr cmd=get_all_databases14/07/09 19:59:09 INFO exec. ddltask:results:114/07/09 19:59:10 INFO QL. Driver: </perflog method=task. Ddl. Stage-0 start=1404907149980 end=1404907150032 duration=52>14/07/09 19:59:10 INFO QL. Driver: </perflog method=runtasks start=1404907149980 end=1404907150032 duration=52>14/07/09 19:59:10 INFO QL. Driver: </perflog method=driver.execute start=1404907149980 end=1404907150032 duration=52>14/07/09 19:59:10 INFO QL. driver:ok14/07/09 19:59:10 INFO QL. Driver: <perflog method=releaselocks>14/07/09 19:59:10 INFO QL. Driver: </perflog method=releaselocks start=1404907150033 end=1404907150033 duration=0>14/07/09 19:59:10 INFO QL . Driver: </perflog method=driver.run start=1404907149925 end=1404907150033 duration=108>14/07/09 19:59:10 INFO Mapred. Fileinputformat:total input paths to process:114/07/09 19:59:10 INFO QL. Driver: <perflog method=releaselocks>14/07/09 19:59:10INFO QL. Driver: </perflog method=releaselocks start=1404907150037 end=1404907150037 duration=0>show_databases: Org.apache.spark.sql.SchemaRDD = schemardd[16] at the RDD at schemardd.scala:100== Query Plan ==<native command:executed b Y hive>

To execute a query plan:

Scala> show_databases.collect () 14/07/09 20:00:44 INFO Spark. Sparkcontext:starting job:collect at sparkplan.scala:5214/07/09 20:00:44 INFO Scheduler. Dagscheduler:got Job 2 (collect at sparkplan.scala:52) with 1 output partitions (allowlocal=false) 14/07/09 20:00:44 INFO Scheduler. Dagscheduler:final Stage:stage 2 (collect at sparkplan.scala:52) 14/07/09 20:00:44 INFO Scheduler. Dagscheduler:parents of Final stage:list () 14/07/09 20:00:44 INFO Scheduler. Dagscheduler:missing parents:list () 14/07/09 20:00:44 INFO Scheduler. Dagscheduler:submitting Stage 2 (mappedrdd[20] at map at sparkplan.scala:52), which have no missing parents14/07/09 20:00: Scheduler INFO. Dagscheduler:submitting 1 missing tasks from Stage 2 (mappedrdd[20] at map at sparkplan.scala:52) 14/07/09 20:00:44 INFO s Cheduler. Taskschedulerimpl:adding Task Set 2.0 with 1 tasks14/07/09 20:00:44 INFO Scheduler. Tasksetmanager:starting Task 2.0:0 as TID 9 on executor 0:WEB01.DW (process_local) 14/07/09 20:00:44 INFO Scheduler.tasksetmanager:serialized task 2.0:0 as 1511 bytes in 0 ms14/07/09 20:00:45 INFO Scheduler. Dagscheduler:completed Resulttask (2, 0) 14/07/09 20:00:45 INFO Scheduler. Tasksetmanager:finished TID 9 in the MS on WEB01.DW (PROGRESS:1/1) 14/07/09 20:00:45 INFO Scheduler. Taskschedulerimpl:removed TaskSet 2.0, whose tasks has all completed, from pool 14/07/09 20:00:45 INFO Scheduler. Dagscheduler:stage 2 (collect at sparkplan.scala:52) finished in 0.014 s14/07/09 20:00:45 INFO Spark. Sparkcontext:job Finished:collect at sparkplan.scala:52, took 0.020520428 sres5:array[org.apache.spark.sql.row] = Arra Y ([default])

Returns the default database.

Same execution: Show tables

Scala> hql ("Show Tables"). Collect ()

14/07/09 20:01:28 INFO Scheduler. taskschedulerimpl:removed TaskSet 3.0, whose tasks has all completed, from pool 14/07/09 20:01:28 INFO Scheduler. Dagscheduler:stage 3 (collect at sparkplan.scala:52) finished in 0.013 s14/07/09 20:01:28 INFO Spark. Sparkcontext:job Finished:collect at sparkplan.scala:52, took 0.019173851 sres7:array[org.apache.spark.sql.row] = Arra Y ([item], [SRC])

The theory is to support all operations of hive, including UDFs.

PS: Problems encountered:

caused By:org.datanucleus.exceptions.NucleusException:Attempt to invoke the ' BONECP ' plugin to create a connectionpool g Ave an error:the specified datastore driver ("Com.mysql.jdbc.Driver") is not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.

The solution: It is when I start up with the Sql-connector path.

Three, Summary:

Spark SQL is compatible with most of Hive's syntax and UDFs, but when it comes to processing query plans, the catalyst framework is optimized to optimize the execution plan for the Spark programming model, making it much more efficient than hive. As the Spark1.1 is not yet released, there are still bugs, until the stable version is released to continue testing.

Full text end:)

Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/37603261

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark SQL with Hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark SQL with Hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support