Ubuntu14.04 or 16.04 under installation jdk1.8+scala+hadoop2.7.3+spark2.0.2

Last Update:2016-12-11 Source: Internet

Author: User

Tags scp command

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To simplify the installation of Hadoop and Spark, write this today.

First of all, to see how many machines on hand, to install pseudo-distributed hadoop+spark or fully distributed, here are recorded separately.

1. Pseudo-Distributed installation

Pseudo-distributed Hadoop is the Namenode,secondarynamenode,datanode and so on a machine to execute, spark the same, generally used in the development environment.

1.1 Preparatory work

System Preparation: A Ubuntu16.04 machine, preferably connected to the Internet

Four installation packages ready: JDK-8U111-LINUX-X64.TAR.GZ,SCALA-2.12.0.TGZ,HADOOP-2.7.3.TAR.GZ,SPARK-2.0.2-BIN-HADOOP2.7.TGZ

1.2 Configuring SSH password-free login

SSH is the fundamental guarantee of free data communication between different machines in a cluster. After the installation is complete, try SSH to this computer if you need a password.

sudo apt-get install SSH openssh-server# installation sshssh-keygen-t rsa-p "" Cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # Configure key service SSH start# start SSH service

1.3 Extract four packages and configure environment variables

Extract Four packages:

TAR-ZXVF Jdk-8u111-linux-x64.tar.gzsudo MV jdk1.8.0_111/usr/lib/# unzip the JDK and move to/usr/lib/TAR-ZXVF Scala-2.12.0.tgzsudo MV scala-2.12.0/usr/lib/# Unzip Scala and move to/usr/lib/TAR-ZXVF hadoop-2.7.3.tar.gz# extract Hadoop package TAR-ZXVF spark-2.0.2-bin-hadoop2.7.tgz# Unpacking the Spark pack

To configure environment variables:

The environment variable for the current user is located in the ~/.profile,root user's environmental variable located in/etc/profile. Here we configure the environment variable as the current user.

Vim ~/.profile# Open environment variable # Add the following variable to export Java_home=/usr/lib/jdk1.8.0_111export Scala_home=/usr/lib/scala-2.12.0export Hadoop_home=/home/user/hadoop-2.7.3export spark_home=/home/user/spark-2.0.2-bin-hadoop2.7export PATH= $JAVA _home/ Bin: $PATH: $SPARK _home/bin: $SPARK _home/sbin: $HADOOP _home/bin: $HADOOP _home/sbinexport classpath=.: $JAVA _home/lib/ Dt.jar: $JAVA _home/lib/tools.jar# to take effect immediately after saving the source ~/.profile

1.4 Configuring Hadoop

There are three files to configure: Core-site.xml,mapred-site.xml,hdfs-site.xml.

Add the following information in the Core-site.xml:

Vim hadoop-2.7.3/etc/hadoop/core-site.xml# Open File <configuration>    <property>       <name> hadoop.tmp.dir</name>       <value>file:/home/user/hadoop/tmp</value>       <description> Abase for other temporary directories.</description>    </property>    <property>       <name >fs.defaultFS</name>       <value>hdfs://localhost:9000</value>    </property></ Configuration>

Add the following information in the Mapred-site.xml:

CP hadoop-2.7.3/etc/hadoop/mapred-site.xml.template hadoop-2.7.3/etc/hadoop/mapred-site.xml# Copy a vim hadoop-2.7.3/ etc/hadoop/mapred-site.xml# Open File <configuration>   <property>      <name>mapred.job.tracker </name>      <value>localhost:9001</value>   </property></configuration>

Add the following information in Hdfs-site.xml, where replication is the number of machines, where 1,user is the current user name:

Vim hadoop-2.7.3/etc/hadoop/hdfs-site.xml# Open File <configuration>   <property>      <name> dfs.replication</name>      <value>1</value>   </property>   <property>      <name>dfs.namenode.name.dir</name>      <value>file:/home/user/hadoop/tmp/dfs/name</value >   </property>   <property>      <name>dfs.datanode.data.dir</name>      < Value>file:/home/user/hadoop/tmp/dfs/data</value>   </property></configuration>

If you cannot find an environment variable when you start Hadoop, you can clear it in hadoop-2.7.3/etc/hadoop/hadoop-env.sh: Export java_home=/usr/lib/jdk1.8.0_111

1.5 Configuring Spark

Spark only needs to configure the spark-env.sh file.

vim/home/user/spark-2.0.2-bin-hadoop2.7/conf/spark-env.sh# Open File Export Java_home=/usr/lib/jdk1.8.0_111export SCALA _home=/usr/lib/scala-2.12.0export Spark_master_host=masterexport hadoop_conf_dir=/home/user/hadoop-2.7.3/etc/ Hadoop/export Spark_worker_memory=8gexport spark_worker_cores=16# A lot of configuration items, please refer to the tips in the file

1.6 Starting Hadoop and spark

Format Hadoop's HDFs (Distributed File System) first, which is the necessary step, otherwise namenode cannot start. But also do not need to be formatted every time you start Hadoop, otherwise it will cause data and name incompatibility, so that Datanode can not start, if this happens, delete the version file under tmp/data/current/. Reformat HDFs.

The start Hadoop and Spark commands are:

$HADOOP _home/bin/hdfs namenode-format# Format HDFs
$HADOOP _home/sbin/start-all.sh# Start HADOOP
$SPARK _home/sbin/start-all.sh# Start SPARK

After startup, the JPS command, if the Hadoop Datanode,namenode,secondarynamenode,***manager are started, both the master and the worker of Spark start, the cluster successfully started, is indispensable.

At this point, access http://localhost:50070 can access the Hadoop cluster Web task viewing page and access http://localhost:8080 to access the Spark Cluster Web task viewing page.

2. Fully Distributed Installation

As the name implies, a fully distributed installation is a true cluster deployment, typically used in production environments.

2.1 Preparatory work

System Preparation: A Ubuntu16.04 machine as Master (ip:192.168.1.1), preferably networked, 1 units and above Ubuntu16.04 machine as slave node (ip:192.168.1.2 ... ）

Four installation packages ready: JDK-8U111-LINUX-X64.TAR.GZ,SCALA-2.12.0.TGZ,HADOOP-2.7.3.TAR.GZ,SPARK-2.0.2-BIN-HADOOP2.7.TGZ

2.2 Configuring SSH password-free login

SSH is the fundamental guarantee of free data communication between different machines in a cluster. After the installation is complete, try SSH to this computer if you need a password.

The steps here are the same as in 1.2, but the RSA keys on master are copied to other machines for free access between machines.

SCP ~/.ssh/id_rsa.pub [email protected]:/home/user/# remote copy RSA secret key SSH slave01;# login Slave01mkdir ~/.ssh;cat ~/id_rsa.pub > > ~/.ssh/authorized_keys;exit;# generate public key and exit SSH slave11;# reconnect to see if a password is required and the automatic login succeeds

And so on, all the slave nodes do the same operation, and finally realize the master to any slave node SSH login without password.

In addition, we need to configure the hosts file on each machine to enable SSH to hostname to access the machine without an explicit IP address.

sudo vim/etc/hosts# open hosts# deposit the following IP with hostname mapping 192.168.1.1       master192.168.1.2       slave01192.168.1.3       slave02192.168.1.4       slave03 ...

Do the same for all slave nodes in turn.

2.3 Extract four packages and configure environment variables

To extract four packages and configure environment variables on the master Master node machine, the same as 1.3, the same is required by the SCP command to copy the configured JDK and Scala whole to the other slave nodes.

2.4 Configuring Hadoop

There are three files to configure: Core-site.xml,mapred-site.xml,hdfs-site.xml,slaves.

The first three and 1.4 are the same, just change the localhost to master (the hostname name of the master node), replication to the specific number of machines.

Finally, it is also necessary to copy the configured Hadoop overall to the other slave nodes through the SCP command.

2.5 Configuring Spark

Spark only needs to configure the spark-env.sh file, this step is the same as 1.5. Finally, the configured spark is copied to the other slave nodes through the SCP command.

2.6 Starting Hadoop and spark

Start Hadoop and Spark on the master node and verify that the cluster is successfully started by JPS.

This allows you to develop with eclipse or IntelliJ on a single machine and run the entire cluster task on the cluster!

Ubuntu14.04 or 16.04 under installation jdk1.8+scala+hadoop2.7.3+spark2.0.2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More