Toss a day, finally solved the RESULT3 in the last section of the mistake. As to why this error occurred, here, first sell a xiaoguanzi, first see how this problem is found:
First, I found this article: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html There is a paragraph:
The issue is, you ' re using sqlcontext instead of Hivecontext. SqlContext implements a smaller subset of the SQL language and so you ' re getting a SQL parse error because it doesn ' t supp ORT the syntax you have. Look at how do you ' d write the This in HiveQL, and then try doing the with Hivecontext.
In fact, there is more problems than that. The Sparksql would conserve (15+5=20) columns in the final table, if I remember well. Therefore, when you were doing join on both tables which have the same columns would cause doublecolumn error.
Two points are mentioned here: (1) use of Hivecontext; (2) that is the cause of the error.
Well, when it comes to using hivecontext, let's use Hivecontext (a long day here):
First of all, look at the requirements of the use of Hivecontext, here refer to this article: http://www.cnblogs.com/byrhuangqiang/p/4012087.html
There are three requirements in this article:
1. Check the $spark_home/lib directory for Datanucleus-api-jdo-3.2.1.jar, Datanucleus-rdbms-3.2.1.jar, Datanucleus-core-3.2.2.jar These several jar packages.
2. Check if there is a copy of the Hive-site.xml from the $hive_home/conf directory under the $spark_home/conf directory.
3. When submitting the program, specify the database driver jar package to Driverclasspath, such as Bin/spark-submit--driver-class-path *.jar. or set the Spark_classpath in spark-env.sh.
Let's configure it as required, but after the configuration is complete error (interactive mode):
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Preliminary judgment, is the Hive connection source database this block, so in the Hive-site.xml file to add the connection source database parameters:
<property>
<name>hive.metastore.uris</name>
<value>thrift://111.121.21.23:9083</value>
<description></description>
</property>
After specifying a good parameter, full of anticipation of the execution of a query, and error (this error tangled for a long time):
ERROR objectstore:version information not found in Metastore.
The error is that when using hivecontext, you need to access the hive's data source, get the version information of the data source, and throw the exception if you don't get it. There are quite a few solutions on the web that need to add parameters to the Hive-site.xml file:
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description></description>
</property>
After adding the parameters, restart the hive service, execute Spark's hivecontext, and still report the error. After the program is compiled and packaged with the IDE, it is placed on the server for execution:
#!/bin/bash
cd/opt/huawei/bigdata/datasight_fm_baseplatform_v100r001c00_spark/spark/
./bin/spark-submit \
--class hivecontexttest \
--master local \
--files/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml \
/home/wlb/spark.jar \
--archives datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar \
--classpath/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/lib/*.jar
Helpless, also reported another mistake (really collapsed!) ): Java.net.UnknownHostException:hacluster
This is the dfs.nameservices of Hadoop.
Well, cannot parse the host name Hacluster throws the exception, then continues, the net gives the result is:
The configuration hdfs-site.xml needs to be copied to the Conf directory of Spark, and sure enough, after the copy is complete, the program's jar package will finally run successfully on the server.
But in retrospect, the error is: Errorobjectstore:version information not found in Metastore.
What is the cause of this? What's the difference between executing a jar package and shell mode?
Continue, using shell mode to execute Hivecontext-based SQL, or to report this error, then open Spark's debug to see what valuable information, for a long time, did not find any valuable logs. Continue to search online, this problem on the Internet, are warn level, and my is the error level.
There is really no train of thought here. Well, since my jar package is successful, let's see what's the difference between using the jar package execution and the pattern?
The first thought is, why Hive-site.xml parameters:hive.metastore.schema.verification not effective? I restarted the service Ah, is not the reference to this parameter?!
Hey, I'll add the hive_home environment variable, execute it, or not, that is, no reference to the parameter ... has been on the verge of collapse, after a long time, suddenly thought, I executed the Spark-shell command from where? So take a look: which Spark-shell
As if something had been discovered, the Spark-shell is the bin directory from the Spark client program (previously designed to use the command conveniently, set and environment variables---Huawei products), that is to say, my default environment variable is pointing to the Spark client program Directory!!!
Finally found the root of the problem, so, the Hive-site.xml, Hdfs-site.xml copy to the client program Conf directory, restart the hive service, all ok!
After a while, still a little worry, in the end is not this problem caused? OK, then on the other node test, first in the client program directory does not have this parameter, execution failed, add,hive.metastore.schema.verification is in effect!
Done! Throughout the process, the debug functionality of Spark has been open, but no valuable information has been found in the logs.
Yes, to use the IDE to debug Spark's Hivecontext program, you need to add the resource directory (type resources) under the main directory, and add Hive-site.xml, Hdfs-site.xml to the directory.
and three drive packages are introduced:
Datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar
Almost forgot, I was to solve the RESULT3 problem in the last section, haha, this problem is actually due to the sparksql of SQL syntax support problems. You might consider using other methods (not nesting subqueries within in), such as setting up multiple rdd or left and right connections (pending testing).
Next section, about the Scala IDE How to configure (this problem in Qingming holiday toss two days, summed up two ways)
This article is from the "one step. One Step" blog, be sure to keep this source http://snglw.blog.51cto.com/5832405/1634438
Sparksql Preliminary Application (Hivecontext use)