Hadoop installation memo
Refer to Liu Peng's "Practical Hadoop" and follow the instructions in hadoop 0.20.2.
Practical Hadoop: open a shortcut to cloud computing pdf hd scan Version Download
First, understand several background processes in Hadoop.
NameNode, Secondary NameNode, JobTracker, TaskTracker, and DataNode roles.
NameNode: Responsible for splitting data blocks and the node to which the data is split. It centrally manages memory and I/O.
This process is deployed on the Master node. It is a single point, and the entire system is down.
Secondary NameNode: Like NameNode, It is a helper. Each cluster has one, which communicates with NameNode and regularly stores HDFS metadata snapshots. When NameNode fails, it can be used as a backup NameNode. It is also deployed on the Master node.
JobTracker is responsible for scheduling jobs. It determines which files are run by which nodes and listens to the heartbeat sent by TaskTracker. If the heartbeat packet is not received, the task will be restarted if the task fails. Each cluster has only one JobTracker. It is deployed on the Master node.
The preceding three processes are deployed on the Master node, while TaskTracker and DataNode process are all deployed in the cluster.
DataNode reads and writes HDFS data blocks to the local file system. When the client reads and writes a database, NameNode tells the client to go to The DataNode, and then the client directly communicates with the DataNode server and operates related data blocks.
TaskTracker is also located in the slave node. It is responsible for executing specific tasks independently. Each slave node can have only one TaskTracker, but each TaskTracker can generate multiple Java virtual machines, it is used to process multiple maps and reduce in parallel. TaskTracker also interacts with JobTracker. JobTasker is responsible for assigning tasks and detecting the heartbeat of TaskTracker. If there is no heartbeat, it is considered to have crashed and will be assigned to other TaskTracker.
The deployment diagram of each process is as follows:
You can refer to the steps in the installation process, but pay attention to them.
Create a dedicated user grid running hadoop on the host and slave, and set a password-free logon mechanism for SSH. For details, refer. Integrate the contents in the public key files on all machines into an authorized_keys file to enable password-free logon to ssh.
When starting hadoop, be sure to log on as a grid user and perform operations in the grid user's home directory. Sometimes, permission issues occur, in this case, set the owner of the hadoop folder on the host and slave to the grid user and group. Run chown-R grid: grid/home/grid/hadoop-1.2.1 (this is the directory where hadoop is placed, which needs to be modified by the root user)
Then you can start the start-all.sh under the bin directory in the hadoop folder, you can see the following information, indicating that the startup is successful.
In this case, you can run the command to view the process startup status. Run the jps file in jdk on the host and you can see the following:
Run the same command on the slave node.
Now, Hadoop has been installed successfully.
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition