Sparksql Preliminary Application (Hivecontext use)

Source: Internet
Author: User
Tags parse error scala ide

Toss a day, finally solved the RESULT3 in the last section of the mistake. As to why this error occurred, here, first sell a xiaoguanzi, first see how this problem is found:


First, I found this article: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html There is a paragraph:

The issue is, you ' re using sqlcontext instead of Hivecontext. SqlContext implements a smaller subset of the SQL language and so you ' re getting a SQL parse error because it doesn ' t supp ORT the syntax you have. Look at how do you ' d write the This in HiveQL, and then try doing the with Hivecontext.

In fact, there is more problems than that. The Sparksql would conserve (15+5=20) columns in the final table, if I remember well. Therefore, when you were doing join on both tables which have the same columns would cause doublecolumn error.

Two points are mentioned here: (1) use of Hivecontext; (2) that is the cause of the error.


Well, when it comes to using hivecontext, let's use Hivecontext (a long day here):

First of all, look at the requirements of the use of Hivecontext, here refer to this article: http://www.cnblogs.com/byrhuangqiang/p/4012087.html

There are three requirements in this article:

1. Check the $spark_home/lib directory for Datanucleus-api-jdo-3.2.1.jar, Datanucleus-rdbms-3.2.1.jar, Datanucleus-core-3.2.2.jar These several jar packages.

2. Check if there is a copy of the Hive-site.xml from the $hive_home/conf directory under the $spark_home/conf directory.

3. When submitting the program, specify the database driver jar package to Driverclasspath, such as Bin/spark-submit--driver-class-path *.jar. or set the Spark_classpath in spark-env.sh.

Let's configure it as required, but after the configuration is complete error (interactive mode):

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Preliminary judgment, is the Hive connection source database this block, so in the Hive-site.xml file to add the connection source database parameters:

<property>
<name>hive.metastore.uris</name>
<value>thrift://111.121.21.23:9083</value>
<description></description>
</property>

After specifying a good parameter, full of anticipation of the execution of a query, and error (this error tangled for a long time):

ERROR objectstore:version information not found in Metastore.

The error is that when using hivecontext, you need to access the hive's data source, get the version information of the data source, and throw the exception if you don't get it. There are quite a few solutions on the web that need to add parameters to the Hive-site.xml file:

<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description></description>
</property>

After adding the parameters, restart the hive service, execute Spark's hivecontext, and still report the error. After the program is compiled and packaged with the IDE, it is placed on the server for execution:

#!/bin/bash

cd/opt/huawei/bigdata/datasight_fm_baseplatform_v100r001c00_spark/spark/

./bin/spark-submit \

--class hivecontexttest \

--master local \

--files/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml \

/home/wlb/spark.jar \

--archives datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar \

--classpath/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/lib/*.jar


Helpless, also reported another mistake (really collapsed!) ): Java.net.UnknownHostException:hacluster

This is the dfs.nameservices of Hadoop.

Well, cannot parse the host name Hacluster throws the exception, then continues, the net gives the result is:

The configuration hdfs-site.xml needs to be copied to the Conf directory of Spark, and sure enough, after the copy is complete, the program's jar package will finally run successfully on the server.

But in retrospect, the error is: Errorobjectstore:version information not found in Metastore.

What is the cause of this? What's the difference between executing a jar package and shell mode?

Continue, using shell mode to execute Hivecontext-based SQL, or to report this error, then open Spark's debug to see what valuable information, for a long time, did not find any valuable logs. Continue to search online, this problem on the Internet, are warn level, and my is the error level.

There is really no train of thought here. Well, since my jar package is successful, let's see what's the difference between using the jar package execution and the pattern?

The first thought is, why Hive-site.xml parameters:hive.metastore.schema.verification not effective? I restarted the service Ah, is not the reference to this parameter?!

Hey, I'll add the hive_home environment variable, execute it, or not, that is, no reference to the parameter ... has been on the verge of collapse, after a long time, suddenly thought, I executed the Spark-shell command from where? So take a look: which Spark-shell


As if something had been discovered, the Spark-shell is the bin directory from the Spark client program (previously designed to use the command conveniently, set and environment variables---Huawei products), that is to say, my default environment variable is pointing to the Spark client program Directory!!!

Finally found the root of the problem, so, the Hive-site.xml, Hdfs-site.xml copy to the client program Conf directory, restart the hive service, all ok!

After a while, still a little worry, in the end is not this problem caused? OK, then on the other node test, first in the client program directory does not have this parameter, execution failed, add,hive.metastore.schema.verification is in effect!


Done! Throughout the process, the debug functionality of Spark has been open, but no valuable information has been found in the logs.


Yes, to use the IDE to debug Spark's Hivecontext program, you need to add the resource directory (type resources) under the main directory, and add Hive-site.xml, Hdfs-site.xml to the directory.

and three drive packages are introduced:

Datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar


Almost forgot, I was to solve the RESULT3 problem in the last section, haha, this problem is actually due to the sparksql of SQL syntax support problems. You might consider using other methods (not nesting subqueries within in), such as setting up multiple rdd or left and right connections (pending testing).


Next section, about the Scala IDE How to configure (this problem in Qingming holiday toss two days, summed up two ways)

This article is from the "one step. One Step" blog, be sure to keep this source http://snglw.blog.51cto.com/5832405/1634438

Sparksql Preliminary Application (Hivecontext use)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.