Some time ago because the shark project stopped updating, SQL on Spark was split into two directions, one for spark SQL on hive and the other for hive on spark. It is still a long time since hive on Spark is available, so I intend to try out spark SQL on hive to gradually replace the current work of Mr on Hive.
The current trial version is spark1.0.0, if you want to support hive, you must recompile, the compiled command changes
Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn-pyarn-phive-dhadoop.version= 2.3.0-cdh5.0.0-dskiptests Clean Package
Wrote a relatively simple piece of code
Val conf = new sparkconf (). Setappname ("Sqlonhive") val sc = new Sparkcontext (conf) val hivecontext = new Hivecontext (SC) Import hivecontext._ hql ("From Tmp.test Select ID Limit 1"). foreach (println)
After compiling the export Jar file, using the standalone mode, it is submitted in JAVA-CP way, before submitting the Hive-site.xml file to copy to the $spark_home/conf directory
Java-xx:permsize=256m-cp/home/hadoop/hql.jar com.yintai.spark.sql.SqlOnHive spark://h031:7077
Exception is reported after submission
java.lang.runtimeexception: error in configuring objectat Org.apache.hadoop.util.ReflectionUtils.setJobConf (reflectionutils.java:109) at Org.apache.hadoop.util.ReflectionUtils.setConf (reflectionutils.java:75) at Org.apache.hadoop.util.ReflectionUtils.newInstance (reflectionutils.java:133) at Org.apache.spark.rdd.HadoopRDD.getInputFormat (hadooprdd.scala:155) at org.apache.spark.rdd.hadooprdd$ $anon $1.<init> (hadooprdd.scala:187) At org.apache.spark.rdd.hadooprdd.compute (HadoopRDD.scala:181) at org.apache.spark.rdd.hadooprdd.compute (hadooprdd.scala:93) at Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (mappedrdd.scala:31) at Org.apache.spark.rdd.RDD.computeOrReadCheckpoint (rdd.scala:262) At org.apache.spark.rdd.rdd.iterator ( rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute (Mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappedrdd.compute (Mappedrdd.scala : At org.apache.spark.rdd.rdd.computeorreadcheckpoint (rdd.scala:262) at Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.rdd.mappartitionsrdd.compute ( mappartitionsrdd.scala:35) At org.apache.spark.rdd.rdd.computeorreadcheckpoint (RDD.scala:262) at Org.apache.spark.rdd.RDD.iterator (rdd.scala:229) At org.apache.spark.scheduler.shufflemaptask.runtask ( shufflemaptask.scala:158) At org.apache.spark.scheduler.shufflemaptask.runtask (ShuffleMapTask.scala:99) at org.apache.spark.scheduler.task.run (task.scala:51) at org.apache.spark.executor.executor$ Taskrunner.run (executor.scala:187) At java.util.concurrent.threadpoolexecutor$worker.runtask ( threadpoolexecutor.java:895) At java.util.concurrent.threadpoolexecutor$worker.run (THREADPOOLEXECUtor.java:918) At java.lang.thread.run (thread.java:662) caused by: JAVA.LANG.REFLECT.INVOCATIONTARGETEXCEPTIONAT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method) At sun.reflect.nativemethodaccessorimpl.invoke (nativemethodaccessorimpl.java:39) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25) at Java.lang.reflect.Method.invoke (method.java:597) at org.apache.hadoop.util.reflectionutils.setjobconf ( reflectionutils.java:106) ... 27 morecaused by: java.lang.illegalargumentexception: compression codec com.hadoop.compression.lzo.lzocodec not found.at Org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses (compressioncodecfactory.java:135) at org.apache.hadoop.io.compress.CompressionCodecFactory.<init> (compressioncodecfactory.java:175) at org.apache.hadoop.mapred.textinputformat.configure (textinputformat.java:45) ... 32 morEcaused by: java.lang.classnotfoundexception: class com.hadoop.compression.lzo.lzocodec not foundat org.apache.hadoop.conf.configuration.getclassbyname (Configuration.java:1801) at org.apache.hadoop.io.compress.compressioncodecfactory.getcodecclasses (CompressionCodecFactory.java:128) ... 34 more
The workaround is to set the relevant environment variables, set in spark-env.sh
Spark_library_path= $SPARK _library_path:/path/to/your/hadoop-lzo/libs/native spark_classpath= $SPARK _classpath:/ Path/to/your/hadoop-lzo/java/libs
Re-commit after modifying environment variable, continue error
14/07/23 10:25:19 error retryinghmshandler: nosuchobjectexception (Message:There is &NBSP;NO&NBSP;DATABASE&NBSP;NAMED&NBSP;TMP) at Org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase (objectstore.java:431) at org.apache.hadoop.hive.metastore.objectstore.getdatabase (ObjectStore.java:441) &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method) at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25) at java.lang.reflect.method.invoke (method.java:597) at org.apache.hadoop.hive.metaStore. Retryingrawstore.invoke (retryingrawstore.java:124) at Com.sun.proxy. $Proxy 9.getDatabase (Unknown source) at Org.apache.hadoop.hive.metastore.hivemetastore$hmshandler.get_database (hivemetastore.java:628) &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method) at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:39) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:25) at java.lang.reflect.method.invoke (method.java:597) at org.apache.hadoop.hive.metastore.retryinghmshandler.invoke (RetryingHMSHandler.java : 103) &Nbsp;at com.sun.proxy. $Proxy 10.get_database (Unknown source) at org.apache.hadoop.hive.metastore.hivemetastoreclient.getdatabase (HiveMetaStoreClient.java : 810) at sun.reflect.nativemethodaccessorimpl.invoke0 ( Native method) at Sun.reflect.NativeMethodAccessorImpl.invoke (nativemethodaccessorimpl.java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke (delegatingmethodaccessorimpl.java:25 ) at java.lang.reflect.method.invoke (Method.java:597) at Org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke (retryingmetastoreclient.java:89) at com.sun.proxy. $Proxy 11.getDatabase (Unknown source) &nBsp; at org.apache.hadoop.hive.ql.metadata.hive.getdatabase (Hive.java:1139) at org.apache.hadoop.hive.ql.metadata.hive.databaseexists ( hive.java:1128) at Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3479) at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237) at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151) at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65) at Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414) At org.apache.hadoop.hive.ql.driver.execute (driver.java:1192) &Nbsp;at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020) at org.apache.hadoop.hive.ql.driver.run (driver.java:888) at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185) at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala : at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249) at Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246) &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85) &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (HiveContext.scala:90) &nbSp; at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20) at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala) 14/07/23 10:25:19 error DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: tmp at Org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase (ddltask.java:3480) at org.apache.hadoop.hive.ql.exec.ddltask.execute (ddltask.java:237) at org.apache.hadoop.hive.ql.exec.task.executetask (Task.java:151) at org.apache.hadoop.hive.ql.exec.taskrunner.runsequential ( TASKRUNNER.JAVA:65) at Org.apache.hadoop.hive.ql.Driver.launchTask (driver.java:1414) aT org.apache.hadoop.hive.ql.driver.execute (driver.java:1192) at org.apache.hadoop.hive.ql.driver.runinternal (driver.java:1020) at org.apache.hadoop.hive.ql.driver.run (driver.java:888) at org.apache.spark.sql.hive.hivecontext.runhive (hivecontext.scala:185) at org.apache.spark.sql.hive.hivecontext.runsqlhive (HiveContext.scala : at org.apache.spark.sql.hive.hivecontext$) Queryexecution.tordd$lzycompute (hivecontext.scala:249) at Org.apache.spark.sql.hive.hivecontext$queryexecution.tordd (hivecontext.scala:246) &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HIVEQL (HiveContext.scala:85) at&NBSP;ORG.APACHE.SPARK.SQL.HIVE.HIVECONTEXT.HQL (hivecontext.scala:90) at com.yintai.spark.sql.sqlonhive$.main (sqlonhive.scala:20) at com.yintai.spark.sql.sqlonhive.main (Sqlonhive.scala)
The reason for this error is that the spark program could not load to hive-site.xml, so it could not get the address of the remote Metastore service, it could only be found in the local derby database, and could not find the metadata information of the related library table naturally. Spark SQL actually loads the Hive-site.xml file by instantiating the Hiveconf class, which is the same way as the Hive CLI, and the code is as follows
ClassLoader ClassLoader = Thread.CurrentThread (). Getcontextclassloader (); if (ClassLoader = = null) {ClassLoader = HiveConf.class.getClassLoader (); } Hivedefaulturl = Classloader.getresource ("Hive-default.xml"); Look for Hive-site.xml on the CLASSPATH and log it location if found. Hivesiteurl = Classloader.getresource ("Hive-site.xml");
Environment variables cannot be set correctly using JAVA-CP submission, in version 1.0.0, a new way of submitting using the Spark-submit script was added, which was later used to submit
/usr/lib/spark/bin/spark-submit--class com.yintai.spark.sql.SqlOnHive--master spark://h031:7077--executor-memor Y 1g--total-executor-cores 1/home/hadoop/hql.jar
The script will set the Spark.executor.extraClassPath and Spark.driver.extraClassPath properties in sparkconf during the commit process, ensuring that the desired configuration file can be loaded correctly and that the test succeeds.
Currently spark SQL on hive is compatible with most hive syntax and UDFs, using the Catalyst Framework for SQL parsing, which is a lot more efficient to run, but there are some bugs in the current version, and there are some problems with stability. You need to wait until the new stable release is released for further testing.
Resources
Http://spark.apache.org/docs/1.0.0/sql-programming-guide.html
http://hsiamin.com/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/
This article is from the "17 blog" blog, be sure to keep this source http://xiaowuliao.blog.51cto.com/3681673/1441737
Probe into spark SQL on hive