First of all, of course, is to download a spark source code, in the http://archive.cloudera.com/cdh5/cdh/5/to find their own source code, compiled their own packaging, about how to compile packaging can refer to my original written article:
http://blog.csdn.net/xiao_jun_0820/article/details/44178169
After execution you should be able to get a compressed package similar to SPARK-1.6.0-CDH5.7.1-BIN-CUSTOM-SPARK.TGZ (the version differs depending on the version you download)
Then upload to the node, extract to the/OPT directory, extract the directory should be Spark-1.6.0-cdh5.7.1-bin-custom-spark, the name is too long, give it a soft chain:
Ln-s Spark-1.6.0-cdh5.7.1-bin-custom-spark Spark
Then cd/opt/spark/conf directory, the inside of the template files are deleted, no eggs, and then the key, under the Conf directory to create two soft connections yarn-conf and log4j.properties respectively point to the CDH of the Spark configuration directory ( The default directory is/etc/spark/conf unless you have changed) the appropriate file below:
Ln-s/etc/spark/conf/yarn-conf yarn-conf
Ln-s/etc/spark/conf/log4j.properties log4j.properties
Then copy the/etc/spark/conf directory below the classpath.txt,spark-defaults.conf,spark-env.sh three files to your own Spark conf directory, this example is/opt/spark/ Conf, the final/opt/spark/conf directory has 5 files:
To edit the Classpath.txt file, locate the spark-related jar package inside, there should be two:
/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar
/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/jars/spark-streaming-flume-sink_2.10-1.6.0-cdh5.7.1.jar
One is the jar package for Spark yarn shuffle (which will be used if dynamic resource allocation is enabled), which is also under the Spark_home/lib directory of your own packaged spark, replacing the path in it with your own jar path:/opt/ Spark/lib/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar, the other is spark streaming consumption flume, I do not have to delete it, if you want to use, change to their own jar package path
Next modify the spark-defaults.conf file, the CDH comes with the following should be:
Spark.authenticate=false spark.dynamicallocation.enabled=true spark.dynamicallocation.executoridletimeout=60 Spark.dynamicallocation.minexecutors=0 spark.dynamicallocation.schedulerbacklogtimeout=1 spark.eventLog.enabled=
True #spark. Serializer=org.apache.spark.serializer.kryoserializer spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337 spark.eventlog.dir=hdfs://name84:8020/user/spark/applicationhistory spark.yarn.historyserver.address=http://name84:18088 spark.yarn.jar=local:/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark/lib/spark-assembly.jar Spark.driver.extralibrarypath=/opt/cloudera/parcels /cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.executor.extralibrarypath=/opt/cloudera/parcels/ Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native spark.yarn.am.extralibrarypath=/opt/cloudera/parcels/
Cdh-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native Spark.yarn.config.gatewaypath=/opt/cloudera/parcels spark.yarn.config.replacementpath={{hadoop_common_home}}/. /.. /.. Spark.master=yarn-client
Just modify the path of the Spark.yarn.jar and modify it to your own jar package path:
Spark.yarn.jar=local:/opt/spark/lib/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar
Then modify the spark-env.sh to change the original export Spark_home=/opt/cloudera/parcels/cdh-5.7.1-1.cdh5.7.1.p0.11/lib/spark to export Spark_home=/opt/spark
OK, all modifications are complete.
Each machine is loaded with its own spark binary bundle in the same way.
There are two points to note that the Log4j.properties and yarn-conf directories use a soft chain, so you modify the configuration in Cloudera manager without affecting your newly installed spark. But spark-default.conf and spark-env.sh is not a soft chain, spark-env.sh generally will not change no matter, spark-default.conf inside there are some configuration information, if you modify these configuration in cm, is not synchronized, you need to manually change These configurations, such as when you change the historyserver to a machine deployment, you have to modify this configuration yourself.
Do the soft chain should also be able, nothing is Spark.yarn.jar this configuration changed a bit, should be in the Spark-submit script through--conf spark.yarn.jar=xxxx set back. Not yet attempted.
Try submitting a task to try it:
/opt/spark/bin/spark-submit--class com.kingnet.framework.StreamingRunnerPro--master yarn-client--num-executors 2 --driver-memory 1g--executor-memory 1g--executor-cores 1/opt/spark/lib/dm-streaming-pro.jar test
or Yarn-cluster mode:
/opt/spark/bin/spark-submit--class Com.kingnet.framework.StreamingRunnerPro--master yarn-cluster--num-executors 2--driver-memory 1g--executor-memory 1g--executor-cores 1 Hdfs://name84:8020/install/dm-streaming-pro.ja R Test
Reference: http://spark.apache.org/docs/1.6.3/hadoop-provided.html
Using Spark ' s "Hadoop free" Build
Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages ' Hadoop free ' builds that lets-more easily connect a single Spark Binary to any Hadoop version. To the use of these builds, you need to modify Spark_dist_classpath to include Hadoop's package jars. The most convenient place-to-do-is-adding-entry in conf/spark-env.sh.
This page describes how to connect Spark to Hadoop for different types of distributions. Apache Hadoop
For Apache distributions, you can use Hadoop ' s ' classpath ' command. For instance:
# # # in conf/spark-env.sh # # #
# # If ' Hadoop ' binary was on your PATH
export spark_dist_classpath=$ (Hadoop CLASSPATH) c2/># with explicit path to ' Hadoop ' binary
export spark_dist_classpath=$ (/path/to/hadoop/bin/hadoop CLASSPATH)
# Passing a Hadoop configuration directory
export spark_dist_classpath=$ (Hadoop--config/path/to/configs Classpath