To simplify the installation of Hadoop and Spark, write this today.
First of all, to see how many machines on hand, to install pseudo-distributed hadoop+spark or fully distributed, here are recorded separately.
1. Pseudo-Distributed installation
Pseudo-distributed Hadoop is the Namenode,secondarynamenode,datanode and so on a machine to execute, spark the same, generally used in the development environment.
1.1 Preparatory work
System Preparation: A Ubuntu16.04 machine, preferably connected to the Internet
Four installation packages ready: JDK-8U111-LINUX-X64.TAR.GZ,SCALA-2.12.0.TGZ,HADOOP-2.7.3.TAR.GZ,SPARK-2.0.2-BIN-HADOOP2.7.TGZ
1.2 Configuring SSH password-free login
SSH is the fundamental guarantee of free data communication between different machines in a cluster. After the installation is complete, try SSH to this computer if you need a password.
sudo apt-get install SSH openssh-server# installation sshssh-keygen-t rsa-p "" Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # Configure key service SSH start# start SSH service
1.3 Extract four packages and configure environment variables
Extract Four packages:
TAR-ZXVF Jdk-8u111-linux-x64.tar.gzsudo MV jdk1.8.0_111/usr/lib/# unzip the JDK and move to/usr/lib/TAR-ZXVF Scala-2.12.0.tgzsudo MV scala-2.12.0/usr/lib/# Unzip Scala and move to/usr/lib/TAR-ZXVF hadoop-2.7.3.tar.gz# extract Hadoop package TAR-ZXVF spark-2.0.2-bin-hadoop2.7.tgz# Unpacking the Spark pack
To configure environment variables:
The environment variable for the current user is located in the ~/.profile,root user's environmental variable located in/etc/profile. Here we configure the environment variable as the current user.
Vim ~/.profile# Open environment variable # Add the following variable to export Java_home=/usr/lib/jdk1.8.0_111export Scala_home=/usr/lib/scala-2.12.0export Hadoop_home=/home/user/hadoop-2.7.3export spark_home=/home/user/spark-2.0.2-bin-hadoop2.7export PATH= $JAVA _home/ Bin: $PATH: $SPARK _home/bin: $SPARK _home/sbin: $HADOOP _home/bin: $HADOOP _home/sbinexport classpath=.: $JAVA _home/lib/ Dt.jar: $JAVA _home/lib/tools.jar# to take effect immediately after saving the source ~/.profile
1.4 Configuring Hadoop
There are three files to configure: Core-site.xml,mapred-site.xml,hdfs-site.xml.
Add the following information in the Core-site.xml:
Vim hadoop-2.7.3/etc/hadoop/core-site.xml# Open File <configuration> <property> <name> hadoop.tmp.dir</name> <value>file:/home/user/hadoop/tmp</value> <description> Abase for other temporary directories.</description> </property> <property> <name >fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property></ Configuration>
Add the following information in the Mapred-site.xml:
CP hadoop-2.7.3/etc/hadoop/mapred-site.xml.template hadoop-2.7.3/etc/hadoop/mapred-site.xml# Copy a vim hadoop-2.7.3/ etc/hadoop/mapred-site.xml# Open File <configuration> <property> <name>mapred.job.tracker </name> <value>localhost:9001</value> </property></configuration>
Add the following information in Hdfs-site.xml, where replication is the number of machines, where 1,user is the current user name:
Vim hadoop-2.7.3/etc/hadoop/hdfs-site.xml# Open File <configuration> <property> <name> dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/user/hadoop/tmp/dfs/name</value > </property> <property> <name>dfs.datanode.data.dir</name> < Value>file:/home/user/hadoop/tmp/dfs/data</value> </property></configuration>
If you cannot find an environment variable when you start Hadoop, you can clear it in hadoop-2.7.3/etc/hadoop/hadoop-env.sh: Export java_home=/usr/lib/jdk1.8.0_111
1.5 Configuring Spark
Spark only needs to configure the spark-env.sh file.
vim/home/user/spark-2.0.2-bin-hadoop2.7/conf/spark-env.sh# Open File Export Java_home=/usr/lib/jdk1.8.0_111export SCALA _home=/usr/lib/scala-2.12.0export Spark_master_host=masterexport hadoop_conf_dir=/home/user/hadoop-2.7.3/etc/ Hadoop/export Spark_worker_memory=8gexport spark_worker_cores=16# A lot of configuration items, please refer to the tips in the file
1.6 Starting Hadoop and spark
Format Hadoop's HDFs (Distributed File System) first, which is the necessary step, otherwise namenode cannot start. But also do not need to be formatted every time you start Hadoop, otherwise it will cause data and name incompatibility, so that Datanode can not start, if this happens, delete the version file under tmp/data/current/. Reformat HDFs.
The start Hadoop and Spark commands are:
$HADOOP _home/bin/hdfs namenode-format# Format HDFs
$HADOOP _home/sbin/start-all.sh# Start HADOOP
$SPARK _home/sbin/start-all.sh# Start SPARK
After startup, the JPS command, if the Hadoop Datanode,namenode,secondarynamenode,***manager are started, both the master and the worker of Spark start, the cluster successfully started, is indispensable.
At this point, access http://localhost:50070 can access the Hadoop cluster Web task viewing page and access http://localhost:8080 to access the Spark Cluster Web task viewing page.
2. Fully Distributed Installation
As the name implies, a fully distributed installation is a true cluster deployment, typically used in production environments.
2.1 Preparatory work
System Preparation: A Ubuntu16.04 machine as Master (ip:192.168.1.1), preferably networked, 1 units and above Ubuntu16.04 machine as slave node (ip:192.168.1.2 ... )
Four installation packages ready: JDK-8U111-LINUX-X64.TAR.GZ,SCALA-2.12.0.TGZ,HADOOP-2.7.3.TAR.GZ,SPARK-2.0.2-BIN-HADOOP2.7.TGZ
2.2 Configuring SSH password-free login
SSH is the fundamental guarantee of free data communication between different machines in a cluster. After the installation is complete, try SSH to this computer if you need a password.
The steps here are the same as in 1.2, but the RSA keys on master are copied to other machines for free access between machines.
SCP ~/.ssh/id_rsa.pub [email protected]:/home/user/# remote copy RSA secret key SSH slave01;# login Slave01mkdir ~/.ssh;cat ~/id_rsa.pub > > ~/.ssh/authorized_keys;exit;# generate public key and exit SSH slave11;# reconnect to see if a password is required and the automatic login succeeds
And so on, all the slave nodes do the same operation, and finally realize the master to any slave node SSH login without password.
In addition, we need to configure the hosts file on each machine to enable SSH to hostname to access the machine without an explicit IP address.
sudo vim/etc/hosts# open hosts# deposit the following IP with hostname mapping 192.168.1.1 master192.168.1.2 slave01192.168.1.3 slave02192.168.1.4 slave03 ...
Do the same for all slave nodes in turn.
2.3 Extract four packages and configure environment variables
To extract four packages and configure environment variables on the master Master node machine, the same as 1.3, the same is required by the SCP command to copy the configured JDK and Scala whole to the other slave nodes.
2.4 Configuring Hadoop
There are three files to configure: Core-site.xml,mapred-site.xml,hdfs-site.xml,slaves.
The first three and 1.4 are the same, just change the localhost to master (the hostname name of the master node), replication to the specific number of machines.
Finally, it is also necessary to copy the configured Hadoop overall to the other slave nodes through the SCP command.
2.5 Configuring Spark
Spark only needs to configure the spark-env.sh file, this step is the same as 1.5. Finally, the configured spark is copied to the other slave nodes through the SCP command.
2.6 Starting Hadoop and spark
Start Hadoop and Spark on the master node and verify that the cluster is successfully started by JPS.
This allows you to develop with eclipse or IntelliJ on a single machine and run the entire cluster task on the cluster!
Ubuntu14.04 or 16.04 under installation jdk1.8+scala+hadoop2.7.3+spark2.0.2