Spark is primarily deployed in a production environment in a cluster where Linux systems are installed. Installing Spark on a Linux system requires pre-installing the dependencies required for JDK, Scala, and so on.
Because Spark is a computing framework, you need to have a persistence layer in the cluster that stores the data beforehand, such as HDFs, Hive, Cassandra, and so on, and then run the app from the startup script.
1. Installing the JDK Oracle JDK Download Address: http://www.oracle.com/technetwork/java/javase/downloads/index.html Configuring environment variables
Vim ~/.bash_profile
Add the following content
java_home=/opt/jdk1.8.0_65
classpath= $JAVA _home/lib/
path= $PATH: $HOME/bin: $JAVA _home/bin
Executing the source ~/.bash_profile to make the environment variable effective
2. Install Scale
Download Scala Address: http://www.scala-lang.org/download/
Configure environment variables, add the following
Export scala_home=/data/spark/scala-2.12.3/
export path= $PATH: $SCALA _home/bin
Executing the source ~/.bash_profile to make the environment variable effective
Perform scala-version and the normal output indicates success.
3. Installing the Hadoop server
Host Name |
IP Address |
Jdk |
User |
Master |
10.116.33.109 |
1.8.0_65 |
Root |
Slave1 |
10.27.185.72 |
1.8.0_65 |
Root |
Slave2 |
10.25.203.67 |
1.8.0_65 |
Root |
Download address for Hadoop: http://hadoop.apache.org/
Configure the Hosts file (same operation per node) vim/etc/hosts
10.116.33.109 Master
10.27.185.72 Slave
110.25.203.67 Slave2
SSH No password authentication configurationReference: Linux SSH password-free login on the master node must verify that you can log on without a password, or you will get an error. SSH master ssh slave1 ssh slave2
Hadoop Cluster SetupConfiguring environment variables after extracting hadoop-2.7.2.tar.gz files vim ~/.bash_profile
Export hadoop_home=/data/spark/hadoop-2.7.2
export path= $PATH: $HADOOP _home/bin
export path= $PATH: $HADOOP _ Home/sbin
export hadoop_mapred_home= $HADOOP _home
export hadoop_common_home= $HADOOP _home
export HADOOP _hdfs_home= $HADOOP _home
export yarn_home= $HADOOP _home export
hadoop_root_logger=info,console
Export hadoop_common_lib_native_dir= $HADOOP _home/lib/native
export hadoop_opts= "-djava.library.path= $HADOOP _home/ Lib
Execute the source ~/.bash_profile to make the environment variable effective This environment variable operates identically on all nodes.
Modify $hadoop_home/etc/hadoop/hadoop-env.sh
Export java_home=/opt/jdk1.8.0_65/
Even if the environment variable is already configured, it must be modified here, otherwise it will be reported as "Java_home is not set and could not being found."
Modify $hadoop_home/etc/hadoop/slaves
Slave1
Slave2
Modify $hadoop_home/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs:// master:9000</value>
</property>
<property>
<name>io.file.buffer.size</ name>
<value>131072</value>
</property>
<property>
<name> hadoop.tmp.dir</name>
<value>/data/spark/hadoop-2.7.2/tmp</value>
</property >
</configuration>
Modify $hadoop_home/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>Master:50090</value>
</property>
<property>
<name> dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/spark/hadoop-2.7.2/hdfs/name</ value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/spark/hadoop-2.7.2/hdfs/data</value>
</property>
</ Configuration>
Modify $hadoop_home/etc/hadoop/mapred-site.xml (cp mapred-site.xml.template Mapred-site.xml)
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value >yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address </name>
<value>Master:10020</value>
</property>
<property>
< name>mapreduce.jobhistory.address</name>
<value>Master:19888</value>
</ Property>
</configuration>
Modify $hadoop_home/etc/hadoop/yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <v alue>mapreduce_shuffle</value> </property> <property> <name>yarn.resourceman
ager.address</name> <value>Master:8032</value> </property> <property>
<name>yarn.resourcemanager.scheduler.address</name> <value>Master:8030</value>
</property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>Master:8031</value> </property> <property> <name>yarn.reso urcemanager.admin.address</name> <value>Master:8033</value> </property> <pro Perty> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</valu E> </property> </configuration>
Copy the Hadoop folder of the master node to Slave1 and Slave2.
Scp-r hadoop-2.7.2 slave1:/data/spark/
scp-r hadoop-2.7.2 slave2:/data/spark/
Start the cluster on the master node and format the Namenode before starting:
Hadoop Namenode-format
Start:
$HADOOP _home/sbin/start-all.sh Check, each node executes JPS
Namenode Display Datanode display Hadoop management interface HTTP://MASTER:8088/Server hostname has not been modified, but the Hosts file configuration node name, resulting in subsequent failure of various tasks, the main is unable to obtain the server IP address through the host name. Symptoms include: MapReduce ACCEPTED not running
4. Install SparkSpark's download Address: http://spark.apache.org/This example spark version SPARK-2.2.0-BIN-HADOOP2.7.TGZ configuration environment variable contents
Export spark_home=/data/spark/spark-2.2.0-bin-hadoop2.7
export path= $PATH: $SPARK _home/bin
Enter the $spark_home/conf directory and copy the CP Spark-env.sh.template spark-env.sh; CP slaves.template Slaves Configuration spark-env.sh file, add the following
Export SCALA_HOME=/DATA/SPARK/SCALA-2.12.3/export
java_home=/opt/jdk1.8.0_65
export spark_master_ip= 10.116.33.109
export spark_worker_memory=128m
export Hadoop_conf_dir=/data/spark/hadoop-2.7.2/etc/hadoop
Export spark_dist_classpath=$ (/data/spark/hadoop-2.7.2/bin/hadoop CLASSPATH)
export spark_local_ip= 10.116.33.109
Export spark_master_host=10.116.33.109
Spark_master_host must be configured, otherwise the slave node will error "caused by:java.io.IOException:Failed to connect to localhost/127.0.0.1:7077"
To modify $spark_home/conf/slaves, add the following:
Master
Slave1
Slave2
Copy the configured spark files to the Slave1 and SLAVE2 nodes.
SCP $SPARK _home root@slave1: $SPARK _home
SCP $SPARK _home root@slave2: $SPARK _home
Start the cluster on the master node
$SPARK _home/sbin/start_all.sh
To see if the cluster started successfully:
JPS Master node new master process
Slave node new worker process