Probe into spark SQL on hive

Source: Internet
Author: User

Some time ago because the shark project stopped updating, SQL on Spark was split into two directions, one for spark SQL on hive and the other for hive on spark. It is still a long time since hive on Spark is available, so I intend to try out spark SQL on hive to gradually replace the current work of Mr on Hive.

The current trial version is spark1.0.0, if you want to support hive, you must recompile, the compiled command changes

Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn-pyarn-phive-dhadoop.version= 2.3.0-cdh5.0.0-dskiptests Clean Package

Wrote a relatively simple piece of code

Val conf = new sparkconf (). Setappname ("Sqlonhive") val sc = new Sparkcontext (conf) val hivecontext = new Hivecontext (SC) Import hivecontext._ hql ("From Tmp.test Select ID Limit 1"). foreach (println)

After compiling the export Jar file, using the standalone mode, it is submitted in JAVA-CP way, before submitting the Hive-site.xml file to copy to the $spark_home/conf directory

Java-xx:permsize=256m-cp/home/hadoop/hql.jar com.yintai.spark.sql.SqlOnHive spark://h031:7077

Exception is reported after submission

java.lang.runtimeexception: error in configuring objectat  Org.apache.hadoop.util.ReflectionUtils.setJobConf (reflectionutils.java:109) at  Org.apache.hadoop.util.ReflectionUtils.setConf (reflectionutils.java:75) at  Org.apache.hadoop.util.ReflectionUtils.newInstance (reflectionutils.java:133) at  Org.apache.spark.rdd.HadoopRDD.getInputFormat (hadooprdd.scala:155) at org.apache.spark.rdd.hadooprdd$ $anon $1.<init> (hadooprdd.scala:187) At org.apache.spark.rdd.hadooprdd.compute (HadoopRDD.scala:181) at  org.apache.spark.rdd.hadooprdd.compute (hadooprdd.scala:93) at  Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (mappedrdd.scala:31) at  Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute (Mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (Mappedrdd.scala : At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute ( mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (RDD.scala:262) at  Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.scheduler.shufflemaptask.runtask ( shufflemaptask.scala:158) At org.apache.spark.scheduler.shufflemaptask.runtask (ShuffleMapTask.scala:99) at  org.apache.spark.scheduler.task.run (task.scala:51) at org.apache.spark.executor.executor$ Taskrunner.run (executor.scala:187) At java.util.concurrent.threadpoolexecutor$worker.runtask ( threadpoolexecutor.java:895) At java.util.concurrent.threadpoolexecutor$worker.run (THREADPOOLEXECUtor.java:918) At java.lang.thread.run (thread.java:662) caused by:  JAVA.LANG.REFLECT.INVOCATIONTARGETEXCEPTIONAT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native  Method) At sun.reflect.nativemethodaccessorimpl.invoke (nativemethodaccessorimpl.java:39) at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25) at  Java.lang.reflect.Method.invoke (method.java:597) at org.apache.hadoop.util.reflectionutils.setjobconf ( reflectionutils.java:106) ... 27 morecaused by: java.lang.illegalargumentexception:  compression codec com.hadoop.compression.lzo.lzocodec not found.at  Org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses (compressioncodecfactory.java:135) at  org.apache.hadoop.io.compress.CompressionCodecFactory.<init> (compressioncodecfactory.java:175) at  org.apache.hadoop.mapred.textinputformat.configure (textinputformat.java:45) ...  32 morEcaused by: java.lang.classnotfoundexception: class com.hadoop.compression.lzo.lzocodec  not foundat org.apache.hadoop.conf.configuration.getclassbyname (Configuration.java:1801) at  org.apache.hadoop.io.compress.compressioncodecfactory.getcodecclasses (CompressionCodecFactory.java:128) ...  34 more

The workaround is to set the relevant environment variables, set in spark-env.sh

Spark_library_path= $SPARK _library_path:/path/to/your/hadoop-lzo/libs/native spark_classpath= $SPARK _classpath:/ Path/to/your/hadoop-lzo/java/libs

Re-commit after modifying environment variable, continue error

14/07/23 10:25:19 error retryinghmshandler: nosuchobjectexception (Message:There is &NBSP;NO&NBSP;DATABASE&NBSP;NAMED&NBSP;TMP)         at  Org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase (objectstore.java:431)          at org.apache.hadoop.hive.metastore.objectstore.getdatabase (ObjectStore.java:441)   &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native  Method)         at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25)          at java.lang.reflect.method.invoke (method.java:597)          at org.apache.hadoop.hive.metaStore. Retryingrawstore.invoke (retryingrawstore.java:124)         at  Com.sun.proxy. $Proxy 9.getDatabase (Unknown source)         at  Org.apache.hadoop.hive.metastore.hivemetastore$hmshandler.get_database (hivemetastore.java:628)    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method)         at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25)          at java.lang.reflect.method.invoke (method.java:597)          at org.apache.hadoop.hive.metastore.retryinghmshandler.invoke (RetryingHMSHandler.java : 103)        &Nbsp;at com.sun.proxy. $Proxy 10.get_database (Unknown source)          at org.apache.hadoop.hive.metastore.hivemetastoreclient.getdatabase (HiveMetaStoreClient.java : 810)         at sun.reflect.nativemethodaccessorimpl.invoke0 ( Native method)         at  Sun.reflect.NativeMethodAccessorImpl.invoke (nativemethodaccessorimpl.java:39)          at sun.reflect.delegatingmethodaccessorimpl.invoke (delegatingmethodaccessorimpl.java:25 )         at java.lang.reflect.method.invoke (Method.java:597)         at  Org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke (retryingmetastoreclient.java:89)          at com.sun.proxy. $Proxy 11.getDatabase (Unknown source)     &nBsp;    at org.apache.hadoop.hive.ql.metadata.hive.getdatabase (Hive.java:1139)          at org.apache.hadoop.hive.ql.metadata.hive.databaseexists ( hive.java:1128)         at  Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3479)          at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237)          at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151)          at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65)         at  Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414)          At org.apache.hadoop.hive.ql.driver.execute (driver.java:1192)        &Nbsp;at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020)          at org.apache.hadoop.hive.ql.driver.run (driver.java:888)          at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185)          at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala :         at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249)         at  Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85)    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (HiveContext.scala:90)       &nbSp; at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20)          at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala) 14/07/23 10:25:19 error  DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not  exist: tmp        at  Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3480)          at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237)          at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151)          at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65)         at  Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414)         aT org.apache.hadoop.hive.ql.driver.execute (driver.java:1192)          at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020)          at org.apache.hadoop.hive.ql.driver.run (driver.java:888)          at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185)          at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala :         at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249)         at  Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85)          at&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (hivecontext.scala:90)          at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20)          at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala)

The reason for this error is that the spark program could not load to hive-site.xml, so it could not get the address of the remote Metastore service, it could only be found in the local derby database, and could not find the metadata information of the related library table naturally. Spark SQL actually loads the Hive-site.xml file by instantiating the Hiveconf class, which is the same way as the Hive CLI, and the code is as follows

ClassLoader ClassLoader = Thread.CurrentThread (). Getcontextclassloader ();    if (ClassLoader = = null) {ClassLoader = HiveConf.class.getClassLoader ();    } Hivedefaulturl = Classloader.getresource ("Hive-default.xml");    Look for Hive-site.xml on the CLASSPATH and log it location if found. Hivesiteurl = Classloader.getresource ("Hive-site.xml");

Environment variables cannot be set correctly using JAVA-CP submission, in version 1.0.0, a new way of submitting using the Spark-submit script was added, which was later used to submit

/usr/lib/spark/bin/spark-submit--class com.yintai.spark.sql.SqlOnHive--master spark://h031:7077--executor-memor Y 1g--total-executor-cores 1/home/hadoop/hql.jar

The script will set the Spark.executor.extraClassPath and Spark.driver.extraClassPath properties in sparkconf during the commit process, ensuring that the desired configuration file can be loaded correctly and that the test succeeds.

Currently spark SQL on hive is compatible with most hive syntax and UDFs, using the Catalyst Framework for SQL parsing, which is a lot more efficient to run, but there are some bugs in the current version, and there are some problems with stability. You need to wait until the new stable release is released for further testing.


Resources

Http://spark.apache.org/docs/1.0.0/sql-programming-guide.html

http://hsiamin.com/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/

This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1441737

Probe into spark SQL on hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.