[TOC]
1 scenes
In the actual process, this scenario is encountered:
The log data hits into HDFs, and the Ops people load the HDFS data into hive and then use Spark to parse the log, and Spark is deployed in the way spark on yarn.
From the scene, the data in hive needs to be loaded through Hivecontext in our spark program.
If you want to do your own testing, the configuration of the environment can refer to my previous article, mainly has the following need to configure:
- 1.Hadoop Environment
- The configuration of the Hadoop environment can refer to previously written articles;
- 2.Spark Environment
- The spark environment only needs to be configured on the node where the job is submitted, because the way spark on yarn is used;
- 3.Hive Environment
- The hive environment needs to be configured because the spark task needs to be submitted together with the Hive-site.xml file, because only then can the metadata information of the existing hive environment be recognized;
- So in fact, in the deployment mode of spark on yarn, only the hive configuration file is needed to enable Hivecontext to read the metadata information stored in MySQL and the hive table data stored in HDFS;
- The configuration of the hive environment can refer to the previous article;
In fact, there have been articles on spark Standalone with hive that you can refer to: Spark SQL Note Consolidation (iii): Load save function with Spark SQL functions.
2 Writing programs and packaging
As a test case, the test code here is relatively simple, as follows:
Package Cn.xpleaf.spark.scala.sql.p2import org.apache.log4j. {level, Logger}import org.apache.spark.sql.DataFrameimport Org.apache.spark.sql.hive.HiveContextimport Org.apache.spark. {sparkconf, sparkcontext}/** * @author xpleaf */object _01hivecontextops {def main (args:array[string]): Unit = { Logger.getlogger ("Org.apache.spark"). SetLevel (Level.off) Val conf = new sparkconf ()//. Setmaster ("L OCAL[2]). Setappname (S "${_01hivecontextops.getclass.getsimplename}") val sc = new Sparkcontext (conf) Val hivecontext = new Hivecontext (SC) hivecontext.sql ("Show Databases"). Show () hivecontext.sql ("Use MyD B1 ")//Create Teacher_info table val sql1 =" CREATE table Teacher_info (\ n "+" name string,\n "+" height double) \ n "+ "Row format delimited\n" + "fields terminated by ', '" Hivecontext.sql (SQL1)//Create Teacher_basic table Val S QL2 = "CREATE table Teacher_basic (\ n" + "name string,\n" + "Age int,\n" + "married Boolean,\n "+" children int) \ n "+" row format delimited\n "+" fields terminated by ', ' "Hivecontext.sql (SQL2)// Load data into the table Hivecontext.sql ("Load Data inpath ' hdfs://ns1/data/hive/teacher_info.txt ' into table Teacher_info") Hivecontext.sql ("Load Data inpath ' hdfs://ns1/data/hive/teacher_basic.txt ' into table Teacher_basic")//Second step: Calculate two Table associated data Val sql3 = "select\n" + "b.name,\n" + "b.age,\n" + "if (b.married, ' married ', ' unmarried ') as married,\n" + "b.children,\n" + "i.height\n" + "from Teacher_info i\n" + "INNER join Teacher_basic B on I.name=b.name" val joindf:dataframe = Hi Vecontext.sql (sql3) val Joinrdd = Joindf.rdd joinrdd.collect (). foreach (println) joinDF.write.saveAsTa BLE ("Teacher") Sc.stop ()}}
You can see that you're simply building a table in hive, loading data, correlating data, and saving data to a hive table.
It is ready to pack after writing, and note that you do not need to package dependencies together. You can then upload the jar package to our environment.
3 deployment
Write the submit script as follows:
[[email protected] jars]$ cat spark-submit-yarn.sh /home/hadoop/app/spark/bin/spark-submit --class $2 --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 1 --files $SPARK_HOME/conf/hive-site.xml --jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar $1 \
Note that these are very critical --files
and --jars
, as explained below:
--files $HIVE_HOME/conf/hive-site.xml //将Hive的配置文件添加到Driver和Executor的classpath中--jars $HIVE_HOME/lib/mysql-connector-java-5.1.39.jar,…. //将Hive依赖的jar包添加到Driver和Executor的classpath中
You can then execute the script to submit the task to yarn:
[[email protected] jars]$ ./spark-submit-yarn.sh spark-process-1.0-SNAPSHOT.jar cn.xpleaf.spark.scala.sql.p2._01HiveContextOps
4 Viewing results
It is necessary to note that if you need to monitor the execution process, you will need to configure Historyserver (Mr Jobhistoryserver and spark Historyserver) to refer to the article I wrote earlier.
4.1 Yarn UI
4.2 Spark UI
4.3 Hive
You can start hive and then view the data loaded by our Spark program:
Hive (MYDB1) > > > > Show Tables;okt1t2t3_arrt4_mapt5_structt6_empt7_extern Alt8_partitiont8_partition_1t8_partition_copyt9t9_bucketteacherteacher_basicteacher_infotesttidtime taken:0.057 Seconds, Fetched:17 row (s) hive (MYDB1) > select * > From Teacher_info;okzhangsan 175.0lisi 180. 0wangwu 175.0zhaoliu 195.0zhouqi 165.0weiba 185.0Time taken:1.717 seconds, Fetched:6 row (s) hive (MYDB1) > select * > from Teacher_basic;okzhangsan to 0lisi false 0wangwu false 0z Haoliu true 1zhouqi 2weiba true 3Time taken:0.115 seconds, Fetched:6 row (s) hive (MYDB1) > select * > From teacher;okslf4j:failed to load Class "Org.slf4j.impl.StaticLoggerBinder". Slf4j:defaulting to No-operation (NOP) Logger Implementationslf4j:see http://www.slf4j.org/codes.html# Staticloggerbinder for further Details.zhangsan 23 Unmarried 0 175.0lisi 24 unmarried 0 180.0wangwu 25 unmarried 0 175.0zhaoliu 26 married 1 195.0zhouqi 27 married 2 165.0weiba 28 married 3 185.0Time taken:0.134 seconds, Fetched:6 row (s )
5 Problems and Solutions
1.User class threw exception:java.lang.RuntimeException:java.lang.RuntimeException:Unable to instantiate Org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
Note that our Spark deployment pattern is yarn,yarn above is not related to spark and hive dependencies, so when submitting a task, you must specify the jar dependencies to upload:
--jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar \
In fact, when submitting a task, be careful to observe the output of the console:
18/10/09 10:57:44 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar, hdfs://ns1/ user/hadoop/.sparkstaging/application_1538989570769_0023/spark-assembly-1.6.2-hadoop2.6.0.jar18/10/09 10:57:47 INFO yarn. client:uploading resource File:/home/hadoop/jars/spark-process-1.0-snapshot.jar, hdfs://ns1/user/hadoop/. sparkstaging/application_1538989570769_0023/spark-process-1.0-snapshot.jar18/10/09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/mysql-connector-java-5.1.39.jar, hdfs://ns1/user/ hadoop/.sparkstaging/application_1538989570769_0023/mysql-connector-java-5.1.39.jar18/10/09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/datanucleus-api-jdo-3.2.6.jar, hdfs://ns1/user/ hadoop/.sparkstaging/application_1538989570769_0023/datanucleus-api-jdo-3.2.6.jar18/10/09 10:57:47 INFO yarn. Client:uploading Resource file:/home/hadoop/app/spark/lib/datanucleus-core-3.2.10.jar-HDFS://NS1/USER/HADOOP/.SPARKSTAGING/APPLICATION_1538989570769_0023/DATANUCLEUS-CORE-3.2.10.JAR18/10 /09 10:57:47 INFO yarn. client:uploading Resource File:/home/hadoop/app/spark/lib/datanucleus-rdbms-3.2.9.jar, Hdfs://ns1/user/hadoop /.sparkstaging/application_1538989570769_0023/datanucleus-rdbms-3.2.9.jar18/10/09 10:57:47 INFO yarn. client:uploading resource File:/home/hadoop/app/spark/conf/hive-site.xml, hdfs://ns1/user/hadoop/. sparkstaging/application_1538989570769_0023/hive-site.xml18/10/09 10:57:47 INFO yarn. client:uploading resource file:/tmp/spark-6f582e5c-3eef-4646-b8c7-0719877434d8/__spark_conf__103916311924336720. Zip---hdfs://ns1/user/hadoop/.sparkstaging/application_1538989570769_0023/__spark_conf__103916311924336720. Zip
You can also see that it uploads related spark-related jar packages to yarn's environment, which is HDFs, before performing related tasks.
2.User class threw exception:org.apache.spark.sql.execution.QueryExecutionException:FAILED:SemanticException [ Error 10072]: Database does not exist:mydb1
MYDB1 does not exist, indicating that metadata information for the hive environment that we already have is not read, because the Hive-site.xml configuration file is not specified when the task is submitted, as follows:
--files $SPARK_HOME/conf/hive-site.xml \
Spark on Yarn with hive combat case and FAQs