Spark on Yarn with hive combat case and FAQs

Source: Internet
Author: User

[TOC]

1 scenes

In the actual process, this scenario is encountered:

The log data hits into HDFs, and the Ops people load the HDFS data into hive and then use Spark to parse the log, and Spark is deployed in the way spark on yarn.

From the scene, the data in hive needs to be loaded through Hivecontext in our spark program.

If you want to do your own testing, the configuration of the environment can refer to my previous article, mainly has the following need to configure:

    • 1.Hadoop Environment
      • The configuration of the Hadoop environment can refer to previously written articles;
    • 2.Spark Environment
      • The spark environment only needs to be configured on the node where the job is submitted, because the way spark on yarn is used;
    • 3.Hive Environment
      • The hive environment needs to be configured because the spark task needs to be submitted together with the Hive-site.xml file, because only then can the metadata information of the existing hive environment be recognized;
      • So in fact, in the deployment mode of spark on yarn, only the hive configuration file is needed to enable Hivecontext to read the metadata information stored in MySQL and the hive table data stored in HDFS;
      • The configuration of the hive environment can refer to the previous article;

In fact, there have been articles on spark Standalone with hive that you can refer to: Spark SQL Note Consolidation (iii): Load save function with Spark SQL functions.

2 Writing programs and packaging

As a test case, the test code here is relatively simple, as follows:

Package Cn.xpleaf.spark.scala.sql.p2import org.apache.log4j. {level, Logger}import org.apache.spark.sql.DataFrameimport Org.apache.spark.sql.hive.HiveContextimport Org.apache.spark.        {sparkconf, sparkcontext}/** * @author xpleaf */object _01hivecontextops {def main (args:array[string]): Unit = { Logger.getlogger ("Org.apache.spark"). SetLevel (Level.off) Val conf = new sparkconf ()//. Setmaster ("L        OCAL[2]). Setappname (S "${_01hivecontextops.getclass.getsimplename}") val sc = new Sparkcontext (conf) Val hivecontext = new Hivecontext (SC) hivecontext.sql ("Show Databases"). Show () hivecontext.sql ("Use MyD B1 ")//Create Teacher_info table val sql1 =" CREATE table Teacher_info (\ n "+" name string,\n "+" height double) \ n "+ "Row format delimited\n" + "fields terminated by ', '" Hivecontext.sql (SQL1)//Create Teacher_basic table Val S QL2 = "CREATE table Teacher_basic (\ n" + "name string,\n" + "Age int,\n" + "married Boolean,\n "+" children int) \ n "+" row format delimited\n "+" fields terminated by ', ' "Hivecontext.sql (SQL2)//        Load data into the table Hivecontext.sql ("Load Data inpath ' hdfs://ns1/data/hive/teacher_info.txt ' into table Teacher_info") Hivecontext.sql ("Load Data inpath ' hdfs://ns1/data/hive/teacher_basic.txt ' into table Teacher_basic")//Second step: Calculate two  Table associated data Val sql3 = "select\n" + "b.name,\n" + "b.age,\n" + "if (b.married, ' married ', ' unmarried ') as married,\n" + "b.children,\n" + "i.height\n" + "from Teacher_info i\n" + "INNER join Teacher_basic B on I.name=b.name" val joindf:dataframe = Hi Vecontext.sql (sql3) val Joinrdd = Joindf.rdd joinrdd.collect (). foreach (println) joinDF.write.saveAsTa BLE ("Teacher") Sc.stop ()}}

You can see that you're simply building a table in hive, loading data, correlating data, and saving data to a hive table.

It is ready to pack after writing, and note that you do not need to package dependencies together. You can then upload the jar package to our environment.

3 deployment

Write the submit script as follows:

[[email protected] jars]$ cat spark-submit-yarn.sh /home/hadoop/app/spark/bin/spark-submit --class $2 --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 1 --files $SPARK_HOME/conf/hive-site.xml --jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar $1 \

Note that these are very critical --files and --jars , as explained below:

--files $HIVE_HOME/conf/hive-site.xml    //将Hive的配置文件添加到Driver和Executor的classpath中--jars $HIVE_HOME/lib/mysql-connector-java-5.1.39.jar,….    //将Hive依赖的jar包添加到Driver和Executor的classpath中

You can then execute the script to submit the task to yarn:

[[email protected] jars]$ ./spark-submit-yarn.sh spark-process-1.0-SNAPSHOT.jar cn.xpleaf.spark.scala.sql.p2._01HiveContextOps
4 Viewing results

It is necessary to note that if you need to monitor the execution process, you will need to configure Historyserver (Mr Jobhistoryserver and spark Historyserver) to refer to the article I wrote earlier.

4.1 Yarn UI

4.2 Spark UI

4.3 Hive

You can start hive and then view the data loaded by our Spark program:

Hive (MYDB1) > > > > Show Tables;okt1t2t3_arrt4_mapt5_structt6_empt7_extern Alt8_partitiont8_partition_1t8_partition_copyt9t9_bucketteacherteacher_basicteacher_infotesttidtime taken:0.057 Seconds, Fetched:17 row (s) hive (MYDB1) > select * > From Teacher_info;okzhangsan 175.0lisi 180.             0wangwu 175.0zhaoliu 195.0zhouqi 165.0weiba 185.0Time taken:1.717 seconds, Fetched:6 row (s) hive (MYDB1) > select * > from Teacher_basic;okzhangsan to 0lisi false 0wangwu false 0z  Haoliu true 1zhouqi 2weiba true 3Time taken:0.115 seconds, Fetched:6 row (s) hive (MYDB1) > select * > From teacher;okslf4j:failed to load Class "Org.slf4j.impl.StaticLoggerBinder". Slf4j:defaulting to No-operation (NOP) Logger Implementationslf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further Details.zhangsan 23 Unmarried 0 175.0lisi 24 unmarried 0 180.0wangwu 25 unmarried 0 175.0zhaoliu 26 married 1 195.0zhouqi 27 married 2 165.0weiba 28 married 3 185.0Time taken:0.134 seconds, Fetched:6 row (s )
5 Problems and Solutions

1.User class threw exception:java.lang.RuntimeException:java.lang.RuntimeException:Unable to instantiate Org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Note that our Spark deployment pattern is yarn,yarn above is not related to spark and hive dependencies, so when submitting a task, you must specify the jar dependencies to upload:

--jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar \

In fact, when submitting a task, be careful to observe the output of the console:

18/10/09 10:57:44 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar, hdfs://ns1/ user/hadoop/.sparkstaging/application_1538989570769_0023/spark-assembly-1.6.2-hadoop2.6.0.jar18/10/09 10:57:47 INFO yarn. client:uploading resource File:/home/hadoop/jars/spark-process-1.0-snapshot.jar, hdfs://ns1/user/hadoop/. sparkstaging/application_1538989570769_0023/spark-process-1.0-snapshot.jar18/10/09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/mysql-connector-java-5.1.39.jar, hdfs://ns1/user/ hadoop/.sparkstaging/application_1538989570769_0023/mysql-connector-java-5.1.39.jar18/10/09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/datanucleus-api-jdo-3.2.6.jar, hdfs://ns1/user/ hadoop/.sparkstaging/application_1538989570769_0023/datanucleus-api-jdo-3.2.6.jar18/10/09 10:57:47 INFO yarn. Client:uploading Resource file:/home/hadoop/app/spark/lib/datanucleus-core-3.2.10.jar-HDFS://NS1/USER/HADOOP/.SPARKSTAGING/APPLICATION_1538989570769_0023/DATANUCLEUS-CORE-3.2.10.JAR18/10 /09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/datanucleus-rdbms-3.2.9.jar, Hdfs://ns1/user/hadoop /.sparkstaging/application_1538989570769_0023/datanucleus-rdbms-3.2.9.jar18/10/09 10:57:47 INFO yarn. client:uploading resource File:/home/hadoop/app/spark/conf/hive-site.xml, hdfs://ns1/user/hadoop/. sparkstaging/application_1538989570769_0023/hive-site.xml18/10/09 10:57:47 INFO yarn. client:uploading resource file:/tmp/spark-6f582e5c-3eef-4646-b8c7-0719877434d8/__spark_conf__103916311924336720. Zip---hdfs://ns1/user/hadoop/.sparkstaging/application_1538989570769_0023/__spark_conf__103916311924336720. Zip

You can also see that it uploads related spark-related jar packages to yarn's environment, which is HDFs, before performing related tasks.

2.User class threw exception:org.apache.spark.sql.execution.QueryExecutionException:FAILED:SemanticException [ Error 10072]: Database does not exist:mydb1

MYDB1 does not exist, indicating that metadata information for the hive environment that we already have is not read, because the Hive-site.xml configuration file is not specified when the task is submitted, as follows:

--files $SPARK_HOME/conf/hive-site.xml \

Spark on Yarn with hive combat case and FAQs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.