Step-by-step how to deploy a different spark from the CDH version in an existing CDH cluster

Last Update:2018-08-04 Source: Internet

Author: User

Tags shuffle log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, of course, is to download a spark source code, in the http://archive.cloudera.com/cdh5/cdh/5/to find their own source code, compiled their own packaging, about how to compile packaging can refer to my original written article:

http://blog.csdn.net/xiao_jun_0820/article/details/44178169

After execution you should be able to get a compressed package similar to SPARK-1.6.0-CDH5.7.1-BIN-CUSTOM-SPARK.TGZ (the version differs depending on the version you download)

Then upload to the node, extract to the/OPT directory, extract the directory should be Spark-1.6.0-cdh5.7.1-bin-custom-spark, the name is too long, give it a soft chain:

Ln-s Spark-1.6.0-cdh5.7.1-bin-custom-spark Spark

Then cd/opt/spark/conf directory, the inside of the template files are deleted, no eggs, and then the key, under the Conf directory to create two soft connections yarn-conf and log4j.properties respectively point to the CDH of the Spark configuration directory ( The default directory is/etc/spark/conf unless you have changed) the appropriate file below:

Ln-s/etc/spark/conf/yarn-conf yarn-conf

Ln-s/etc/spark/conf/log4j.properties log4j.properties

Then copy the/etc/spark/conf directory below the classpath.txt,spark-defaults.conf,spark-env.sh three files to your own Spark conf directory, this example is/opt/spark/ Conf, the final/opt/spark/conf directory has 5 files:

To edit the Classpath.txt file, locate the spark-related jar package inside, there should be two:

/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar
/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-streaming-flume-sink_2.10-1.6.0-cdh5.7.1.jar

One is the jar package for Spark yarn shuffle (which will be used if dynamic resource allocation is enabled), which is also under the Spark_home/lib directory of your own packaged spark, replacing the path in it with your own jar path:/opt/ Spark/lib/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar, the other is spark streaming consumption flume, I do not have to delete it, if you want to use, change to their own jar package path

Next modify the spark-defaults.conf file, the CDH comes with the following should be:

Spark.authenticate=false spark.dynamicallocation.enabled=true spark.dynamicallocation.executoridletimeout=60 Spark.dynamicallocation.minexecutors=0 spark.dynamicallocation.schedulerbacklogtimeout=1 spark.eventLog.enabled=
True #spark. Serializer=org.apache.spark.serializer.kryoserializer spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337 spark.eventlog.dir=hdfs://name84:8020/user/spark/applicationhistory spark.yarn.historyserver.address=http://name84:18088 spark.yarn.jar=local:/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark/lib/spark-assembly.jar Spark.driver.extralibrarypath=/opt/cloudera/parcels /cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.executor.extralibrarypath=/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.yarn.am.extralibrarypath=/opt/cloudera/parcels/
Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native Spark.yarn.config.gatewaypath=/opt/cloudera/parcels spark.yarn.config.replacementpath={{hadoop_common_home}}/. /.. /.. Spark.master=yarn-client

Just modify the path of the Spark.yarn.jar and modify it to your own jar package path:

Spark.yarn.jar=local:/opt/spark/lib/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar

Then modify the spark-env.sh to change the original export Spark_home=/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark to export Spark_home=/opt/spark

OK, all modifications are complete.

Each machine is loaded with its own spark binary bundle in the same way.

There are two points to note that the Log4j.properties and yarn-conf directories use a soft chain, so you modify the configuration in Cloudera manager without affecting your newly installed spark. But spark-default.conf and spark-env.sh is not a soft chain, spark-env.sh generally will not change no matter, spark-default.conf inside there are some configuration information, if you modify these configuration in cm, is not synchronized, you need to manually change These configurations, such as when you change the historyserver to a machine deployment, you have to modify this configuration yourself.

Do the soft chain should also be able, nothing is Spark.yarn.jar this configuration changed a bit, should be in the Spark-submit script through--conf spark.yarn.jar=xxxx set back. Not yet attempted.

Try submitting a task to try it:

/opt/spark/bin/spark-submit--class com.kingnet.framework.StreamingRunnerPro--master yarn-client--num-executors 2 --driver-memory 1g--executor-memory 1g--executor-cores 1/opt/spark/lib/dm-streaming-pro.jar test

or Yarn-cluster mode:

/opt/spark/bin/spark-submit--class Com.kingnet.framework.StreamingRunnerPro--master yarn-cluster--num-executors 2--driver-memory 1g--executor-memory 1g--executor-cores 1 Hdfs://name84:8020/install/dm-streaming-pro.ja R Test

Reference: http://spark.apache.org/docs/1.6.3/hadoop-provided.html

Using Spark ' s "Hadoop free" Build

Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages ' Hadoop free ' builds that lets-more easily connect a single Spark Binary to any Hadoop version. To the use of these builds, you need to modify Spark_dist_classpath to include Hadoop's package jars. The most convenient place-to-do-is-adding-entry in conf/spark-env.sh.

This page describes how to connect Spark to Hadoop for different types of distributions. Apache Hadoop

For Apache distributions, you can use Hadoop ' s ' classpath ' command. For instance:

# # # in conf/spark-env.sh # # #

# # If ' Hadoop ' binary was on your PATH
export spark_dist_classpath=$ (Hadoop CLASSPATH) c2/># with explicit path to ' Hadoop ' binary
export spark_dist_classpath=$ (/path/to/hadoop/bin/hadoop CLASSPATH)

# Passing a Hadoop configuration directory
export spark_dist_classpath=$ (Hadoop--config/path/to/configs Classpath

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More