Reprint please indicate source: http://www.cnblogs.com/lighten/p/6106891.html
Recently began to learn big data-related knowledge, the most famous is the open source Hadoop platform. Here is a record of the current version of Hadoop in the Ubuntu system build process. The construction process found a very clear and comprehensive construction articles, this article cut some unimportant content, refined some content. Click here to view: original.
Installation of 1.JDK
Hadoop is a big data platform developed using Java, which naturally requires the installation of the Java Runtime Environment, and of course, Hadoop does not necessarily require the Java language, and Hadoop's development supports many languages.
The installation of the Java operating environment is described in another article, which is not described here: Ubuntu16.04 installing the JDK.
2. Configure SSH and password-free login
Hadoop needs to be logged in using SSH, and SSH needs to be installed under Linux. The client is already installed, just install the server side:
sudo apt-get install openssh-server
Test log in native SSH localhost input yes should be able to log in. But each time the input is cumbersome, if it is a cluster that is a disaster, so to be configured as a password-free way to login.
A total of three steps:
1. Generate the Public key ssh-keygen-t RSA, the file will be generated under the ~/.ssh folder Id_rsa: Private key, id_rsa.pub: Public key
2. Import the public key to the authentication file and change the permissions:
1) Import this machine:cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
2) Import the server:
First copy the public key to the server:
SCP ~/.ssh/id_rsa.pub [Email protected]:/home/xxx/id_rsa.pub
The public key is then imported into the authentication file, and this step is done on the server:
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
Finally, change permissions on the server:
chmod ~/.ssh
chmod ~/.ssh/authorized_keys
3) test: SSH localhost needs to enter yes for the first time, then it is not required.
Installation of 3.Hadoop
1. Download the Hadoop installation package: click here. Download binary on the line. You can also use the wget command to download.
2. Unzip, move to the folder you want to place
TAR-ZVXF hadoop-2.7.3.tar.gz
mv./hadoop-2.7.3.tar.gz/opt/hadoop
3. Create Hadoop users and groups and grant Execute permissions
sudo addgroup Hadoop
sudo usermod-a-g Hadoop xxx #将当前用户加入到hadoop组
sudo gedit etc/sudoers #将hadoop组加入到sudoer
After Root all= (all) all, Hadoop all= (All) all
sudo chmod-r 755/opt/hadoop
sudo chown-r xxx:hadoop/opt/hadoop//Otherwise SSH will deny access
These are generally required operations, this article also carried out other configurations, if you encounter problems can be seen, is not due to these configurations caused: point here.
4. Modify the configuration file, as with the JDK installation, you can choose which file to modify. Modify/etc/profile here
Export hadoop_home=/opt/hadoop2.7.3
Export Path=.:${java_home}/bin:${hadoop_home}/bin: $PATH
Source/etc/hadoop
This configuration article also with a lot of other configurations, I temporarily do not configure, encountered problems, can be used as a reference. Point here.
5. Test whether the configuration is successful
Hadoop version
6.hadoop stand-alone configuration (non-distributed mode)
Hadoop defaults to non-distributed mode and no additional configuration is required. You can test the demo to see if it is configured correctly.
Cd/opt/hadoop
mkdir input
CP README.txt Input
Bin/hadoop Jar Share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.3-sources.jar Org.apache.hadoop.examples.WordCount Input Output
7.hadoop Pseudo-distributed configuration
Pseudo-distributed only needs to change two files is enough. The configuration files are in Etc/hadoop in the Hadoop directory.
The first is the Core-site.xml, set the temporary directory location, otherwise the default will be in the/tmp/hadoo-hadoop, the folder in the reboot may be erased by the system, so you need to change the configuration path.
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/ Opt/hadoop/tmp</value> <description>abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs:// Localhost:9000</value> </property></configuration>
Then is Hdfs-site.xml, pseudo-distributed only one node, so must be configured to 1. The node locations for Datanode and Namenode are also configured.
<configuration> <property> <name>dfs.replication</name> <value>1< /value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/tmp/dfs/name</value> </property> <property> < name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/tmp/dfs/data</value> </property></configuration>
Then execute the format command, formatting the name node:./bin/hdfs Namenode-format
To open HDFs: ./sbin/start-dfs.sh If the SSH authentication input Yes is available.
Enter the JPS command to see if the startup was successful
Access http://localhost:50070 view node information.
Close HDFs:./sbin/stop-dfs.sh
The above is the configuration of HDFs, the next need to configure the relevant configuration of MapReduce, does not deserve this does not affect what. However, with the lack of resource scheduling, the hadoop2.x version uses yarn for task scheduling management, which is the biggest difference from the 1.x version.
CP ./etc/hadoop/mapred-site.xml.template./etc/hadoop/mapred-site.xml
Vim./etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value >yarn</value> </property></configuration>
Modify Yarn's configuration file: Yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> < Value>mapreduce_shuffle</value> </property></configuration>
To start yarn, start the hdfs:./sbin/start-yarn.sh first.
Turn on the history server so you can see the task running in the Web interface:./sbin/mr-jobhistory-daemon.sh start Historyserver
When YARN is not enabled, it is "mapred." Localjobrunner "After running the task, enabling YARN, is" mapred. Yarnrunner "in the running task. One of the benefits of starting YARN is that you can view the operation of a task through the Web interface: Http://localhost:8088/cluster.
8. Distributed deployment, no two computers, no attempt, specifically see: here.
4 post-language
Due to the previous configuration of the machine, it is unavoidable to write the time will be missing some details, if there is any problem, please advise.
ubuntu16.04+hadoop2.7.3 Environment Construction