Sparksql Preliminary Application (Hivecontext use)

Last Update:2015-04-17 Source: Internet

Author: User

Tags parse error scala ide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Toss a day, finally solved the RESULT3 in the last section of the mistake. As to why this error occurred, here, first sell a xiaoguanzi, first see how this problem is found:

First, I found this article: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-td16299.html There is a paragraph:

The issue is, you ' re using sqlcontext instead of Hivecontext. SqlContext implements a smaller subset of the SQL language and so you ' re getting a SQL parse error because it doesn ' t supp ORT the syntax you have. Look at how do you ' d write the This in HiveQL, and then try doing the with Hivecontext.

In fact, there is more problems than that. The Sparksql would conserve (15+5=20) columns in the final table, if I remember well. Therefore, when you were doing join on both tables which have the same columns would cause doublecolumn error.

Two points are mentioned here: (1) use of Hivecontext; (2) that is the cause of the error.

Well, when it comes to using hivecontext, let's use Hivecontext (a long day here):

First of all, look at the requirements of the use of Hivecontext, here refer to this article: http://www.cnblogs.com/byrhuangqiang/p/4012087.html

There are three requirements in this article:

1. Check the $spark_home/lib directory for Datanucleus-api-jdo-3.2.1.jar, Datanucleus-rdbms-3.2.1.jar, Datanucleus-core-3.2.2.jar These several jar packages.

2. Check if there is a copy of the Hive-site.xml from the $hive_home/conf directory under the $spark_home/conf directory.

3. When submitting the program, specify the database driver jar package to Driverclasspath, such as Bin/spark-submit--driver-class-path *.jar. or set the Spark_classpath in spark-env.sh.

Let's configure it as required, but after the configuration is complete error (interactive mode):

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Preliminary judgment, is the Hive connection source database this block, so in the Hive-site.xml file to add the connection source database parameters:

<property>
<name>hive.metastore.uris</name>
<value>thrift://111.121.21.23:9083</value>
<description></description>
</property>

After specifying a good parameter, full of anticipation of the execution of a query, and error (this error tangled for a long time):

ERROR objectstore:version information not found in Metastore.

The error is that when using hivecontext, you need to access the hive's data source, get the version information of the data source, and throw the exception if you don't get it. There are quite a few solutions on the web that need to add parameters to the Hive-site.xml file:

<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description></description>
</property>

After adding the parameters, restart the hive service, execute Spark's hivecontext, and still report the error. After the program is compiled and packaged with the IDE, it is placed on the server for execution:

#!/bin/bash

cd/opt/huawei/bigdata/datasight_fm_baseplatform_v100r001c00_spark/spark/

./bin/spark-submit \

--class hivecontexttest \

--master local \

--files/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/conf/hive-site.xml \

/home/wlb/spark.jar \

--archives datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar \

--classpath/opt/huawei/bigdata/hive-0.13.1/hive-0.13.1/lib/*.jar

Helpless, also reported another mistake (really collapsed!) ): Java.net.UnknownHostException:hacluster

This is the dfs.nameservices of Hadoop.

Well, cannot parse the host name Hacluster throws the exception, then continues, the net gives the result is:

The configuration hdfs-site.xml needs to be copied to the Conf directory of Spark, and sure enough, after the copy is complete, the program's jar package will finally run successfully on the server.

But in retrospect, the error is: Errorobjectstore:version information not found in Metastore.

What is the cause of this? What's the difference between executing a jar package and shell mode?

Continue, using shell mode to execute Hivecontext-based SQL, or to report this error, then open Spark's debug to see what valuable information, for a long time, did not find any valuable logs. Continue to search online, this problem on the Internet, are warn level, and my is the error level.

There is really no train of thought here. Well, since my jar package is successful, let's see what's the difference between using the jar package execution and the pattern?

The first thought is, why Hive-site.xml parameters:hive.metastore.schema.verification not effective? I restarted the service Ah, is not the reference to this parameter?!

Hey, I'll add the hive_home environment variable, execute it, or not, that is, no reference to the parameter ... has been on the verge of collapse, after a long time, suddenly thought, I executed the Spark-shell command from where? So take a look: which Spark-shell

As if something had been discovered, the Spark-shell is the bin directory from the Spark client program (previously designed to use the command conveniently, set and environment variables---Huawei products), that is to say, my default environment variable is pointing to the Spark client program Directory!!!

Finally found the root of the problem, so, the Hive-site.xml, Hdfs-site.xml copy to the client program Conf directory, restart the hive service, all ok!

After a while, still a little worry, in the end is not this problem caused? OK, then on the other node test, first in the client program directory does not have this parameter, execution failed, add,hive.metastore.schema.verification is in effect!

Done! Throughout the process, the debug functionality of Spark has been open, but no valuable information has been found in the logs.

Yes, to use the IDE to debug Spark's Hivecontext program, you need to add the resource directory (type resources) under the main directory, and add Hive-site.xml, Hdfs-site.xml to the directory.

and three drive packages are introduced:

Datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus-rdbms-3.2.9.jar

Almost forgot, I was to solve the RESULT3 problem in the last section, haha, this problem is actually due to the sparksql of SQL syntax support problems. You might consider using other methods (not nesting subqueries within in), such as setting up multiple rdd or left and right connections (pending testing).

Next section, about the Scala IDE How to configure (this problem in Qingming holiday toss two days, summed up two ways)

This article is from the "one step. One Step" blog, be sure to keep this source http://snglw.blog.51cto.com/5832405/1634438

Sparksql Preliminary Application (Hivecontext use)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More