Initial Application of SparkSQL (used in HiveContext) and sparksqlhive
After one day, the result3 error in the previous section was finally solved. As to why this error occurs, here we should first sell a token to see how this problem was discovered:
First of all, found this article: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html has such a paragraph:
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn' t support the syntax you have. look at how you 'd write this in HiveQL, and then try doing that with HiveContext.
In fact, there are more problems than that. the sparkSQL will conserve (15 + 5 = 20) columns in the final table, if I remember well. therefore, when you are doing join on two tables which have the same columns will cause doublecolumn error.
Two points are mentioned here: (1) Using HiveContext; (2) that is the cause of this error.
Well, when it comes to using HiveContext, we can use HiveContext (Nima, This Is A Long Time ):
First of all, see the use of HiveContext need what requirements, here refer to this article: http://www.cnblogs.com/byrhuangqiang/p/4012087.html
The article has three requirements:
1. Check the $ SPARK_HOME/lib directory for datanucleus-api-jdo-3.2.1.jar, datanucleus-rdbms-3.2.1.jar, datanucleus-core-3.2.2.jar jar packages.
2. Check whether the $ SPARK_HOME/conf directory has a hive-site.xml copied from the $ HIVE_HOME/conf directory.
3. When submitting a program, specify the jar package of the database driver to DriverClassPath, for example, bin/spark-submit -- driver-class-path *. jar. Or set SPARK_CLASSPATH in the spark-env.sh.
Then configure as required. However, an error is reported after the configuration is complete (Interactive Mode ):
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Preliminary judgment, it is the problem of hive connection to the source database, so add the parameters to connect to the source database in the hive-site.xml file:
<Name> hive. metastore. uris </name>
<Value> thrift: // 22.214.171.124: 9083. </value>
After the parameters are specified, a query is executed with expectation, and Nima reports an error (this error has been entangled For A Long Time ):
ERROR ObjectStore: Version information not found in metastore.
This error indicates that when using HiveContext, you need to access the Hive data source to obtain the version information of the data source. If you cannot obtain the version information, this exception is thrown. There are a lot of solutions online, and you need to add parameters to the hive-site.xml file:
<Name> hive. metastore. schema. verification </name>
<Value> false </value>
After adding the parameters, restart the Hive service and run Spark HiveContext, an error is still reported. Use IDE To compile and package the program and put it on the server for execution:
-- Class HiveContextTest \
-- Master local \
-- Files/opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml \
/Home/wlb/spark. jar \
-- Archives datanucleus-api-jdo-3.2.6.jar, datanucleus-core-3.2.10.jar, datanucleus-rdbms-3.2.9.jar \
-- Classpath/opt/huawei/Bigdata/hive-0.13.1/hive-0.13.1/lib/*. jar
Helpless, and report another error (it's really a crash !) : Java.net. UnknownHostException: hacluster
This is hadoop's dfs. nameservices
Well, if the exception thrown by the host name hacluster cannot be resolved, the online result will be:
You need to copy the configuration hdfs-site.xml to the conf directory of spark. After the replication is complete, the jar package compiled by the program can finally run successfully on the server.
But in retrospect, the ERROR: ERROR ObjectStore: Version information not found in metastore.
Why? What is the difference between running the jar package and the shell mode?
Continue. Run the HiveContext-based SQL statement in shell mode and report this error. Open the debug of spark to check the valuable information for a long time, no valuable logs are found. Continue searching on the Internet. The online issue is at the WARN level, while mine is at the ERROR level.
So far, there is no idea. Well, since my jar package can be successfully executed, let's see what is the difference between using jar package execution and this mode?
The first thought was, why is the hive-site.xml parameter hive. metastore. schema. verification not effective? I restarted the service. Didn't I reference this parameter ?!
Well, I will add the HIVE_HOME environment variable and execute it for a moment. It still does not take effect. That is to say, this parameter is not referenced... It has been on the verge of collapse. After a long time, I suddenly thought, where did my spark-shell command come from? Let's take a look: which spark-shell
It seems that something has been found. The spark-shell is from the bin directory of the spark Client Program (previously, to facilitate the use of commands, we set the environment variable --- Huawei products), that is, my default environment variables point to the spark client program directory !!!
Finally found the root of the problem, so, copy the hive-site.xml, hdfs-site.xml to the conf directory of the client program, restart the hive service, everything OK!
After a while, I am not at ease. Is it because of this problem? Well, I tested it on other nodes. First, this parameter is not found in the client program directory. Execution failed. After adding the parameter, hive. metastore. schema. verification takes effect!
Success! The debug function of spark has been enabled throughout the process, but no valuable information is found in logs.
By the way, to debug Spark's HiveContext program with IDE, you need to add the resource Directory (type: Resources) under the main directory and add the hive-site.xml and hdfs-site.xml to that directory.
And introduce three driver packages:
Datanucleus-api-jdo-3.2.6.jar, datanucleus-core-3.2.10.jar, datanucleus-rdbms-3.2.9.jar
I almost forgot. I tried to solve the result3 problem in the previous section. Haha, this problem is actually caused by SparkSQL's support for SQL syntax. You can consider using other methods (not nested subqueries IN), such as setting multiple RDD or left/right connections (to be tested ).