Probe into spark SQL on hive

Last Update:2014-07-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some time ago because the shark project stopped updating, SQL on Spark was split into two directions, one for spark SQL on hive and the other for hive on spark. It is still a long time since hive on Spark is available, so I intend to try out spark SQL on hive to gradually replace the current work of Mr on Hive.

The current trial version is spark1.0.0, if you want to support hive, you must recompile, the compiled command changes

Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn-pyarn-phive-dhadoop.version= 2.3.0-cdh5.0.0-dskiptests Clean Package

Wrote a relatively simple piece of code

Val conf = new sparkconf (). Setappname ("Sqlonhive") val sc = new Sparkcontext (conf) val hivecontext = new Hivecontext (SC) Import hivecontext._ hql ("From Tmp.test Select ID Limit 1"). foreach (println)

After compiling the export Jar file, using the standalone mode, it is submitted in JAVA-CP way, before submitting the Hive-site.xml file to copy to the $spark_home/conf directory

Java-xx:permsize=256m-cp/home/hadoop/hql.jar com.yintai.spark.sql.SqlOnHive spark://h031:7077

Exception is reported after submission

java.lang.runtimeexception: error in configuring objectat  Org.apache.hadoop.util.ReflectionUtils.setJobConf (reflectionutils.java:109) at  Org.apache.hadoop.util.ReflectionUtils.setConf (reflectionutils.java:75) at  Org.apache.hadoop.util.ReflectionUtils.newInstance (reflectionutils.java:133) at  Org.apache.spark.rdd.HadoopRDD.getInputFormat (hadooprdd.scala:155) at org.apache.spark.rdd.hadooprdd$ $anon $1.<init> (hadooprdd.scala:187) At org.apache.spark.rdd.hadooprdd.compute (HadoopRDD.scala:181) at  org.apache.spark.rdd.hadooprdd.compute (hadooprdd.scala:93) at  Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (mappedrdd.scala:31) at  Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute (Mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (Mappedrdd.scala : At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute ( mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (RDD.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.scheduler.shufflemaptask.runtask ( shufflemaptask.scala:158) At org.apache.spark.scheduler.shufflemaptask.runtask (ShuffleMapTask.scala:99) at  org.apache.spark.scheduler.task.run (task.scala:51) at org.apache.spark.executor.executor$ Taskrunner.run (executor.scala:187) At java.util.concurrent.threadpoolexecutor$worker.runtask ( threadpoolexecutor.java:895) At java.util.concurrent.threadpoolexecutor$worker.run (THREADPOOLEXECUtor.java:918) At java.lang.thread.run (thread.java:662) caused by:  JAVA.LANG.REFLECT.INVOCATIONTARGETEXCEPTIONAT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native  Method) At sun.reflect.nativemethodaccessorimpl.invoke (nativemethodaccessorimpl.java:39) at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25) at  Java.lang.reflect.Method.invoke (method.java:597) at org.apache.hadoop.util.reflectionutils.setjobconf ( reflectionutils.java:106) ... 27 morecaused by: java.lang.illegalargumentexception:  compression codec com.hadoop.compression.lzo.lzocodec not found.at  Org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses (compressioncodecfactory.java:135) at  org.apache.hadoop.io.compress.CompressionCodecFactory.<init> (compressioncodecfactory.java:175) at  org.apache.hadoop.mapred.textinputformat.configure (textinputformat.java:45) ...  32 morEcaused by: java.lang.classnotfoundexception: class com.hadoop.compression.lzo.lzocodec  not foundat org.apache.hadoop.conf.configuration.getclassbyname (Configuration.java:1801) at  org.apache.hadoop.io.compress.compressioncodecfactory.getcodecclasses (CompressionCodecFactory.java:128) ...  34 more

The workaround is to set the relevant environment variables, set in spark-env.sh

Spark_library_path= $SPARK _library_path:/path/to/your/hadoop-lzo/libs/native spark_classpath= $SPARK _classpath:/ Path/to/your/hadoop-lzo/java/libs

Re-commit after modifying environment variable, continue error

14/07/23 10:25:19 error retryinghmshandler: nosuchobjectexception (Message:There is &NBSP;NO&NBSP;DATABASE&NBSP;NAMED&NBSP;TMP)         at  Org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase (objectstore.java:431)          at org.apache.hadoop.hive.metastore.objectstore.getdatabase (ObjectStore.java:441)   &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native  Method)         at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25)          at java.lang.reflect.method.invoke (method.java:597)          at org.apache.hadoop.hive.metaStore. Retryingrawstore.invoke (retryingrawstore.java:124)         at  Com.sun.proxy. $Proxy 9.getDatabase (Unknown source)         at  Org.apache.hadoop.hive.metastore.hivemetastore$hmshandler.get_database (hivemetastore.java:628)    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method)         at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25)          at java.lang.reflect.method.invoke (method.java:597)          at org.apache.hadoop.hive.metastore.retryinghmshandler.invoke (RetryingHMSHandler.java : 103)        &Nbsp;at com.sun.proxy. $Proxy 10.get_database (Unknown source)          at org.apache.hadoop.hive.metastore.hivemetastoreclient.getdatabase (HiveMetaStoreClient.java : 810)         at sun.reflect.nativemethodaccessorimpl.invoke0 ( Native method)         at  Sun.reflect.NativeMethodAccessorImpl.invoke (nativemethodaccessorimpl.java:39)          at sun.reflect.delegatingmethodaccessorimpl.invoke (delegatingmethodaccessorimpl.java:25 )         at java.lang.reflect.method.invoke (Method.java:597)         at  Org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke (retryingmetastoreclient.java:89)          at com.sun.proxy. $Proxy 11.getDatabase (Unknown source)     &nBsp;    at org.apache.hadoop.hive.ql.metadata.hive.getdatabase (Hive.java:1139)          at org.apache.hadoop.hive.ql.metadata.hive.databaseexists ( hive.java:1128)         at  Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3479)          at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237)          at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151)          at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65)         at  Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414)          At org.apache.hadoop.hive.ql.driver.execute (driver.java:1192)        &Nbsp;at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020)          at org.apache.hadoop.hive.ql.driver.run (driver.java:888)          at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185)          at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala :         at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249)         at  Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85)    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (HiveContext.scala:90)       &nbSp; at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20)          at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala) 14/07/23 10:25:19 error  DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not  exist: tmp        at  Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3480)          at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237)          at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151)          at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65)         at  Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414)         aT org.apache.hadoop.hive.ql.driver.execute (driver.java:1192)          at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020)          at org.apache.hadoop.hive.ql.driver.run (driver.java:888)          at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185)          at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala :         at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249)         at  Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85)          at&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (hivecontext.scala:90)          at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20)          at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala)

The reason for this error is that the spark program could not load to hive-site.xml, so it could not get the address of the remote Metastore service, it could only be found in the local derby database, and could not find the metadata information of the related library table naturally. Spark SQL actually loads the Hive-site.xml file by instantiating the Hiveconf class, which is the same way as the Hive CLI, and the code is as follows

ClassLoader ClassLoader = Thread.CurrentThread (). Getcontextclassloader ();    if (ClassLoader = = null) {ClassLoader = HiveConf.class.getClassLoader ();    } Hivedefaulturl = Classloader.getresource ("Hive-default.xml");    Look for Hive-site.xml on the CLASSPATH and log it location if found. Hivesiteurl = Classloader.getresource ("Hive-site.xml");

Environment variables cannot be set correctly using JAVA-CP submission, in version 1.0.0, a new way of submitting using the Spark-submit script was added, which was later used to submit

/usr/lib/spark/bin/spark-submit--class com.yintai.spark.sql.SqlOnHive--master spark://h031:7077--executor-memor Y 1g--total-executor-cores 1/home/hadoop/hql.jar

The script will set the Spark.executor.extraClassPath and Spark.driver.extraClassPath properties in sparkconf during the commit process, ensuring that the desired configuration file can be loaded correctly and that the test succeeds.

Currently spark SQL on hive is compatible with most hive syntax and UDFs, using the Catalyst Framework for SQL parsing, which is a lot more efficient to run, but there are some bugs in the current version, and there are some problems with stability. You need to wait until the new stable release is released for further testing.

Resources

Http://spark.apache.org/docs/1.0.0/sql-programming-guide.html

http://hsiamin.com/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/

This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1441737

Probe into spark SQL on hive

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Probe into spark SQL on hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Probe into spark SQL on hive

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support