Step-by-step how to deploy a different spark from the CDH version in an existing CDH cluster

Source: Internet
Author: User
Tags shuffle log4j

First of all, of course, is to download a spark source code, in the http://archive.cloudera.com/cdh5/cdh/5/to find their own source code, compiled their own packaging, about how to compile packaging can refer to my original written article:

http://blog.csdn.net/xiao_jun_0820/article/details/44178169


After execution you should be able to get a compressed package similar to SPARK-1.6.0-CDH5.7.1-BIN-CUSTOM-SPARK.TGZ (the version differs depending on the version you download)


Then upload to the node, extract to the/OPT directory, extract the directory should be Spark-1.6.0-cdh5.7.1-bin-custom-spark, the name is too long, give it a soft chain:

Ln-s Spark-1.6.0-cdh5.7.1-bin-custom-spark Spark

Then cd/opt/spark/conf directory, the inside of the template files are deleted, no eggs, and then the key, under the Conf directory to create two soft connections yarn-conf and log4j.properties respectively point to the CDH of the Spark configuration directory ( The default directory is/etc/spark/conf unless you have changed) the appropriate file below:

Ln-s/etc/spark/conf/yarn-conf yarn-conf

Ln-s/etc/spark/conf/log4j.properties log4j.properties

Then copy the/etc/spark/conf directory below the classpath.txt,spark-defaults.conf,spark-env.sh three files to your own Spark conf directory, this example is/opt/spark/ Conf, the final/opt/spark/conf directory has 5 files:


To edit the Classpath.txt file, locate the spark-related jar package inside, there should be two:

/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar
/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-streaming-flume-sink_2.10-1.6.0-cdh5.7.1.jar

One is the jar package for Spark yarn shuffle (which will be used if dynamic resource allocation is enabled), which is also under the Spark_home/lib directory of your own packaged spark, replacing the path in it with your own jar path:/opt/ Spark/lib/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar, the other is spark streaming consumption flume, I do not have to delete it, if you want to use, change to their own jar package path


Next modify the spark-defaults.conf file, the CDH comes with the following should be:

Spark.authenticate=false spark.dynamicallocation.enabled=true spark.dynamicallocation.executoridletimeout=60 Spark.dynamicallocation.minexecutors=0 spark.dynamicallocation.schedulerbacklogtimeout=1 spark.eventLog.enabled=
True #spark. Serializer=org.apache.spark.serializer.kryoserializer spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337 spark.eventlog.dir=hdfs://name84:8020/user/spark/applicationhistory spark.yarn.historyserver.address=http://name84:18088 spark.yarn.jar=local:/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark/lib/spark-assembly.jar Spark.driver.extralibrarypath=/opt/cloudera/parcels /cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.executor.extralibrarypath=/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.yarn.am.extralibrarypath=/opt/cloudera/parcels/
Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native Spark.yarn.config.gatewaypath=/opt/cloudera/parcels spark.yarn.config.replacementpath={{hadoop_common_home}}/. /.. /.. Spark.master=yarn-client

Just modify the path of the Spark.yarn.jar and modify it to your own jar package path:

Spark.yarn.jar=local:/opt/spark/lib/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar


Then modify the spark-env.sh to change the original export Spark_home=/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark to export Spark_home=/opt/spark


OK, all modifications are complete.

Each machine is loaded with its own spark binary bundle in the same way.


There are two points to note that the Log4j.properties and yarn-conf directories use a soft chain, so you modify the configuration in Cloudera manager without affecting your newly installed spark. But spark-default.conf and spark-env.sh is not a soft chain, spark-env.sh generally will not change no matter, spark-default.conf inside there are some configuration information, if you modify these configuration in cm, is not synchronized, you need to manually change These configurations, such as when you change the historyserver to a machine deployment, you have to modify this configuration yourself.


Do the soft chain should also be able, nothing is Spark.yarn.jar this configuration changed a bit, should be in the Spark-submit script through--conf spark.yarn.jar=xxxx set back. Not yet attempted.


Try submitting a task to try it:

/opt/spark/bin/spark-submit--class com.kingnet.framework.StreamingRunnerPro--master yarn-client--num-executors 2 --driver-memory 1g--executor-memory 1g--executor-cores 1/opt/spark/lib/dm-streaming-pro.jar test

or Yarn-cluster mode:

/opt/spark/bin/spark-submit--class Com.kingnet.framework.StreamingRunnerPro--master yarn-cluster--num-executors 2--driver-memory 1g--executor-memory 1g--executor-cores 1 Hdfs://name84:8020/install/dm-streaming-pro.ja R Test


Reference: http://spark.apache.org/docs/1.6.3/hadoop-provided.html

Using Spark ' s "Hadoop free" Build

Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages ' Hadoop free ' builds that lets-more easily connect a single Spark Binary to any Hadoop version. To the use of these builds, you need to modify Spark_dist_classpath to include Hadoop's package jars. The most convenient place-to-do-is-adding-entry in conf/spark-env.sh.

This page describes how to connect Spark to Hadoop for different types of distributions. Apache Hadoop

For Apache distributions, you can use Hadoop ' s ' classpath ' command. For instance:

# # # in conf/spark-env.sh # # #

# # If ' Hadoop ' binary was on your PATH
export spark_dist_classpath=$ (Hadoop CLASSPATH) c2/># with explicit path to ' Hadoop ' binary
export spark_dist_classpath=$ (/path/to/hadoop/bin/hadoop CLASSPATH)

# Passing a Hadoop configuration directory
export spark_dist_classpath=$ (Hadoop--config/path/to/configs Classpath


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.