ApacheHadoop1.1.1 + ApacheOozie3.3.2 detailed installation process (test)

Source: Internet
Author: User
Tags tmp folder rsync
Preface: 1. apache version HADOOP cloud computing platform environment build 1.1 Cluster Environment Introduction Hadoop Cluster Environment contains 1 machine, run 4 virtual machines on each machine, is a typical Master-slave (Slaves) structure. The cluster contains one Master node and four Slave nodes (Slave ). In the master-slave structure, master node 1

Preface: 1. apache version HADOOP cloud computing platform environment build 1.1 Cluster Environment Introduction Hadoop Cluster Environment contains 1 machine, run 4 virtual machines on each machine, is a typical Master-slave (Slaves) structure. The cluster contains one Master node and four Slave nodes (Slave ). In the master-slave structure, master node 1

Preface:

1. Build an Apache HADOOP cloud computing platform environment

1.1 Introduction to cluster environment

The Hadoop Cluster Environment contains one machine and runs four virtual machines on each machine. It is a typical Master-slave structure. The cluster contains one Master node and four Slave nodes (Slave ). In a master-slave structure, the master node is generally responsible for cluster management, task scheduling, and load balancing. The slave node executes computing and storage tasks from the master node.

The specific hardware, software, and network configurations of the cluster are shown in Table 3.1.

Table 3.1 detailed cluster hardware and software and network configuration

Serial number

Host Name

Network Address

Operating System

1

Master

192.168.137.2

Centos6.4-x64

2

Slave1

192.168.137.3

Centos6.4-x64

3

Slave2

192.168.137.4

Centos6.4-x64

1.2 Hadoop Environment Construction

1.2.1 introduction to Apache Hadoop

Hadoop is a distributed computing framework organized by Apache open-source. It runs applications on clusters composed of a large number of cheap hardware devices and provides stable and reliable interfaces for applications, it is designed to build a distributed system with high reliability and good scalability. With the gradual popularization and popularization of cloud computing technology, this project has been applied by more and more individuals and enterprises. The core of the Hadoop project is HDFS, MapReduce, and HBase, which are open source implementations of Google's core cloud computing technologies GFS (Google File System), MapReduce, and Bigtable.

1.2.2 Apache Hadoop installation preparation

1. hosts Configuration

This step is required because the user in the environment needs to be allocated and specified here.

2. Change hosts

First, you need to set the IP Address:

# Ifconfig // query the current IP Address

Then, you can set the IP addresses of each machine by setting the VPN. The specific process is not described here.

Now you need to configure the hosts name for each computer and use the following commands:

# Vim/etc/sysconfig/network

Add the following content:

HOSTNAME = Master. Hadoop

(Note: The Slave machine is changed to the corresponding Slave1.Hadoop .......)

3. Configure the hosts file

# Vim/etc/hosts

Add the following lines:

192.168.137.2 Master. Hadoop

192.168.137.3 Slave1.Hadoop

192.168.137.4 Slave2.Hadoop


4. verify whether the settings are enabled.

In Master. Hadoop, ping the Host Name of the server "Slave1.Hadoop" to check whether the test is successful.

Master. Hadoop $ ping192.168.137.3

(Note: The same is true for tests on other machines)

5. Add a user

# Adduser hadoop

# Passwd hadoop // set the hadoop Password

1.3.3 ssh password-less Authentication Settings (the following settings are all performed under hadoop Users)

During Hadoop running, you need to manage the remote Hadoop daemon. After Hadoop is started, NameNode starts and stops various daemon on each DataNode through SSH (Secure Shell. Therefore, you do not need to enter a password when executing commands between nodes. Therefore, you need to configure SSH to use the password-Free Public Key Authentication mode, in this way, NameNode uses SSH to log on without a password and starts the DataName process. Similarly, DataNode can also use SSH to log on to NameNode without a password.

1. Install and start ssh

You can run the following command to check whether ssh and rsync have been installed:

$ Rpm-qa | grep openssh

$ Rpm-qa | grep rsync

2. Generate a password pair on the Master machine

Run the following command on the Master node:

$ Ssh-keygen-t rsa-p''

(Note: Here "'" is a single quotation mark. The preceding command uses two single quotation marks at the end)

This command is used to generate a password-less key pair. When you ask about the storage path, press enter to use the default path. Generated key pairs: id_rsa and id_rsa.pub, which are stored in "~ /. Ssh "directory.

Check whether there is a ". ssh" folder under "/usr/hadoop/" and whether there are two newly produced password-less key pairs in the ". ssh" file. Then, configure the Master node as follows to append id_rsa.pub to the authorized key. Run the following command:

$ Cat ~ /. Ssh/id_rsa.pub> ~ /. Ssh/authorized_keys

Two things need to be done before verification. The first thing is to modify the "authorized_keys" permission of the file (permission setting is very important because insecure settings make you unable to use the RSA function ), another thing is to use the root user to set the content of "/etc/ssh/sshd_config. Enable password-less logon.

3. Modify the permission of the file "authorized_keys"

Run the following commands:

$ Chmod 600 ~ /. Ssh/authorized_keys

4. Set SSH Configuration

Use the root user to log on to the server and modify the following content in the SSH configuration file "/etc/ssh/sshd_config". Remove the following comments: "#".

RSAAuthentication yes # enable RSA Authentication

PubkeyAuthentication yes # enable public key/private key pair Authentication

AuthorizedKeysFile. ssh/authorized_keys # public key file path


After setting, remember to restart the SSH service to make the setting valid.

$ Service sshd restart

Log out of the root account and use the hadoop common user to verify whether the logon is successful.

# Ssh localhost

5. Generate a password pair on the Slave machine

Run the following commands on Slave1.Hadoop:

$ Ssh-keygen-t-rsa-p''

Generated on Slave1.Hadoop machine ~ /. Ssh file. Then, load the password pair generated by the local machine to the authorized_keys on the local machine.

$ Cat ~ /. Ssh/id_rsa.pub> ~ /. Ssh/authorized_keys

Then, change the permissions of the authorized_keys file.

$ Chmod 600 ~ /. Ssh/authorized_keys

(Note: If multiple Slave instances exist, perform the same operation on multiple Slave instances, Slave2.Hadoop, Slave2.Hadoop ......)

6. append the authorization password between the Master and Slave.

Master. Hadoop $ cd ~ /. Ssh

Master. Hadoop $ scp./id_rsa.pub hadoop@192.168.137.3 :~ /

Slave1.Hadoop $ cat ~ /Id_rsa.pub> ~ /. Ssh/authorized_keys

Slave1.Hadoop $ scp ~ /. Ssh/id_rsa.pub hadoop@192.168.137.4 :~ /

Master. Hadoop $ cat ~ /Id_rsa.pub> ~ /. Ssh/authorized_keys

(Note: If multiple Slave instances exist, perform the same operation between multiple Slave instances and the Master node)

Use the root user to log on to the server and modify the following content in the SSH configuration file "/etc/ssh/sshd_config". Remove the following comments: "#".

RSAAuthentication yes # enable RSA Authentication

PubkeyAuthentication yes # enable public key/private key pair Authentication

AuthorizedKeysFile. ssh/authorized_keys # public key file path (same as the file generated above)


After setting, remember to restart the SSH service to make the setting valid.

$ Service sshd restart

Log out of the root account and use the hadoop common user to verify whether the logon is successful.

# Ssh localhost

7. Verify that the ssh password-less login setting is successful.

Master. Hadoop $ ssh Slave1.Hadoop

At this time, if you do not need to enter the login password of the Slave1.Hadoop host, it indicates that the setting is successful. Otherwise, it indicates that the previous setting is incorrect. Please check carefully.

Similarly, run the following command to check whether the Slave settings for the Master are successful:

Slave1.Hadoop $ sshMaster. Hadoop

1.3.4 JAVA environment installation (all hosts must be installed and the following operations should be performed under the root user)

1. Download jdk

Download jdk1.7.0 _ 21 from the official oracle website and put it in the root directory /.

2. Install jdk

Run the following commands:

# Cd/usr

# Mkdir java

# Cp/jdk-7u21-linux-x64.rpm/usr/java/

# Cd/usr/java

# Rpm-ivh jdk-7u21-linux-x64.rpm

Then, the rpm installation process is performed.

3. Set java environment variables

Edit the "/etc/profile" file and add the Java "JAVA_HOME", "CLASSPATH", and "PATH" content to it.

First, edit the "/etc/profile" file.

# Vim/etc/profile

Second, add Java environment variables

Add the following content at the end of the "/etc/profile" file:

# Set java environment

Export JAVA_HOME =/usr/java/jdk1.7.0 _ 21

Export CLASSPATH =.: $ CLASSPATH: $ JAVA_HOME/lib: $ JAVA_HOME/jre/lib

Export PATH = $ PATH: $ JAVA_HOME/bin: $ JAVA_HOME/jre/bin


Save and exit. Execute the following command to make the configuration take effect immediately.

# Source/etc/profile

4. Verify that the installation and configuration are successful.

Enter the following commands:

# Java-version

Write a small JAVA program and check whether the program is correctly executed during compilation and execution. If yes, the configuration is successful. Otherwise, the JAVA configuration process may be faulty and needs to be checked.

1.3.5 Hadoop cluster Installation

Hadoop must be installed on all machines. Now, you must first install hadoop on the Master server, and then repeat the steps on other servers. The installation and configuration of hadoop must be performed as "root.

Log on to the "Master. Hadoop" machine as the root user and check the "hadoop-1.1.1.tar.gz" that we uploaded to the "/" directory via SSH and copy this file to the "/usr/" directory.

1. Enter the "/usr/" directory, run the following command to decompress the "hadoop-1.1.1.tar.gz" and name it "hadoop ", assign the read permission for this folder to hadoop, a common user, and then delete the "hadoop-1.1.1.tar.gz" installation package.

# Cd/usr

# Tar-zxvf hadoop-1.1.1.tar.gz

# Music hadoop-1.1.1 hadoop

# Chown-R hadoop: hadoop

# Rm-r hadoop-1.1.1.tar.gz

2. Add the Hadoop installation path to "/etc/profile" and modify the "/etc/profile" file (the file for configuring java environment variables ), add the following statement to the end and make it valid:

3. Create the "tmp" folder in "/usr/hadoop ".

# Mkdir/usr/hadoop/tmp

4. Configure "/etc/profile"

# Vim/etc/profile

# Set hadoop path

Export HADOOP_HOME =/usr/hadoop

Export PATH = $ PATH: $ HADOOP_HOME/bin


Restart "/etc/profile"

# Source/etc/profile

5. Configure hadoop

To configure the hadoop file, first configure the hadoop-env.sh

The hadoop-env.sh file is located under the/usr/hadoop/conf directory.

Add the following content at the end of the file.

# Set java environment

Export JAVA_HOME =/usr/java/jdk1.7.0 _ 21


The Hadoop configuration file is under the conf directory, and the configuration files of earlier versions are mainly Hadoop-default.xml and Hadoop-site.xml. Due to the rapid development of Hadoop, the amount of Code increased dramatically, code development is divided into core, hdfs and map/reduce three parts, the configuration file is also divided into three core-site.xml, hdfs-site.xml, mapred-site.xml. Core-site.xml and hdfs-site.xml are profiles from the HDFS perspective; core-site.xml and mapred-site.xml are profiles from the MapReduce perspective.

The second is the configuration core-site.xml file, modify the Hadoop core configuration file core-site.xml, where the address and port number of HDFS is configured.

Hadoop. tmp. dir

/Usr/hadoop/tmp

(Note: Create the tmp folder in the/usr/hadoop directory first)

A base for other temporary directories.

Fs. default. name

Hdfs: // 192.168.137.2: 9000


(Note: If the hadoop. tmp. dir parameter is not configured, the default temporary directory of the system is/tmp/hadoo-hadoop. This directory will be killed after each restart and must be re-executed for format; otherwise, an error will occur .)

Then there is the configuration hdfs-site.xml file, modify the HDFS configuration in Hadoop, the configuration of the backup mode is 1 by default.

Dfs. replication

1

(Note: replication refers to the number of data copies. The default value is 3. If there are less than 3 salve instances, an error is returned)


Finally, configure the mapred-site.xml file, modify the MapReduce configuration file in Hadoop, configure the address and port of JobTracker.

Mapred. job. tracker

Http: // 192.168.137.2: 9001


In the following example, You need to configure the masters file and slaves file. For the masters file, remove "localhost" and change the IP address of the Master node in the Cluster: 192.168.137.2; for the slaves file (dedicated to the Master host ), remove "localhost" and add the IP addresses of all Slave machines in the cluster, which is also one per line.

192.168.137.3

192.168.137.4


Now the Hadoop configuration on the Master machine is complete, and the rest is to configure Hadoop on the Slave machine. Copy the "/usr/hadoop" folder of hadoop configured on the Master node to the "/usr" directory of all Slave servers (in fact, the slavers file on the Slave machine is unnecessary, copy the file ). Use the following command format. (Note: The user can be hadoop or root)

# Scp-r/usr/hadoop root@192.168.137.3:/usr/

# Scp-r/usr/hadoop root@192.168.137.4:/usr/

Of course, no matter the root user or hadoop user, although the "/usr/hadoop" folder user hadoop on the Master machine has permissions, but the hadoop user on Slave1 does not have the "/usr" permission, therefore, you are not authorized to create folders. Therefore, no matter which user performs the copy, the right side is in the "root @ machine IP" format. Because we only set up an SSH password-free connection for hadoop users, when "scp" is performed by root, We will prompt you to enter the root password of the "Slave1.Hadoop" server user.

The hadoop folder has been copied, but we find that the hadoop permission is root. Therefore, we need to add the "/usr/hadoop" read permission to the user hadoop on the "Slave1.Hadoop" server. Log on to "Slave1.Hadoop" as the root user and run the following command.

# Chown-R hadoop: hadoop

Then go to "Slave1. modify the "/etc/profile" file on Hadoop (the file for configuring java environment variables), add the following statement to the end, and make it valid (source/etc/profile ):

# Set hadoop environment

Export HADOOP_HOME =/usr/hadoop

Export PATH = $ PATH: $ HADOOP_HOME/bin


1.3.6 startup and Verification

1. format the HDFS File System

You can use Hadoop as a common user on "Master. hadoop. (Note: Only one time, the next start does not need to format, only start-all.sh)

$ Hadoop namenode-format

2. Start hadoop

Disable the firewall of all machines in the cluster before starting the cluster. Otherwise, the firewall is automatically disabled after datanode is enabled.

$ Service iptables stop

Run the following command.

$ Start-all.sh

After hadoop is started successfully, the dfs folder is generated in the tmp folder in the Master, and the dfs folder and mapred folder are generated in the tmp folder in the Slave.

So far, the hadoop cloud computing platform has been configured.


2. OOZIE installation Configuration

2.1 OOZIE Introduction

Oozie is a Java Web application that runs in the assumervlet container-Tomcat and uses the database to store the following content:

? Workflow definition

? Currently running workflow instances, including instance statuses and variables

Oozie workflow is a group of actions (such as Map/Reduce jobs and Pig jobs) placed in the control dependency DAG (directed acyclic Graph DirectAcyclic Graph ), the action execution sequence is specified. We will use hPDL (an XML Process Definition Language) to describe this graph.

HPDL is a simple language that uses only a few Process Control and action nodes. The control node defines the execution process, including the start, end, and fail nodes of the workflow, and the mechanism for controlling the workflow execution path (deworkflow, fork, and join nodes ). Action nodes are mechanisms that trigger computing or processing tasks through their workflows. Oozie supports the following types of actions: hadoop map-reduce, Hadoop file system, Pig, Java, and Oozie sub-workflows (SSH actions have been removed from Versions later than Oozie schema 0.2 ).

All computing and processing tasks triggered by Action nodes are not in Oozie-they are executed by the Hadoop Map/Reduce framework. This approach allows Oozie to support existing Hadoop mechanisms for load balancing and disaster recovery. These tasks are executed asynchronously (except for file system actions, which are processed synchronously ). This means that for most types of computing or processing tasks triggered by Workflow actions, you need to wait before workflow operations are converted to the next node of the workflow, it cannot continue until the computing or processing task is completed. Oozie can detect whether a computing or processing task is completed in two different ways, that is, the return and round robin. When Oozie starts a computing or processing task, it provides a unique callback URL for the task and sends a notification to a specific URL when the task is completed. When a task cannot trigger a callback URL (for example, a transient disconnection of the Network), or when the task type cannot be completed, oozie has a mechanism to poll computing or processing tasks to ensure that tasks can be completed.

Oozie workflow can be parameterized (variables such as $ {inputDir} are used in the workflow definition ). When submitting a workflow operation, we must provide the parameter value. If appropriate parameterization is performed (for example, using different output Directories), multiple identical workflow operations can be performed concurrently.

Some workflows are triggered as needed, but in most cases, it is necessary to run them based on a certain period of time and/or data availability and/or external events. The Oozie Coordination System (Coordinatorsystem) allows you to define workflow execution plans based on these parameters. The Oozie Coordination Program allows us to Model workflow execution triggers in the form of predicates, which can point to data, events, and (or) external events. Workflow jobs are started when the predicates are satisfied.

We often need to connect to workflow operations that run regularly but have different time intervals. The output of multiple subsequent workflows becomes the input of the next workflow. Connecting these workflows will allow the system to reference them as the data application pipeline. Oozie supports the creation of such a data application pipeline.

2.2 OOZIE Installation Process

1. Download oozieand decompress oozie-3.3.2.tar.gz to the root directory:

Get the file "oozie-3.3.2" and put it under the directory. And rename it oozie.

Run the following command:

#./Bin/mkdistro. sh-DskipTests

2. Download The ext-2.2.zip file "ext-2.2" and put it under the directory.

3. Set the HADOOP configuration file

Add the following content in the/usr/hadoop/conf/core-site.xml file:

Hadoop. proxyuser. hadoop. hosts

192.168.137.2

Hadoop. proxyuser. hadoop. groups

Hadoop


After the modification, create a folder/usr/oozie/libext/and put ext-2.2.zip in the directory.

4. Copy the oozie folder to the/usr/directory.

5. Copy the hadoop jar package to oozie.

Copy all the JAR packages under the./hadooplibs/hadoop-1/folder to the libext folder you just created.

6. Unzip the ext-2.2.zip file to./webapp/and decompress it to this directory.

7. Create the oozie. war file. Run the following command:

/Usr/oozie/bin/oozie-setup.sh-extjs/oozie/webapp/src/main/webapps/ext-2.2.zip

Display: Specified Oozie WAR '/usr/oozie. war 'already contains ExtJS library files, you can continue.

8. Set the OOZIE configuration file.

Modify the file/usr/oozie/conf/oozie-site.xml and find the following:

Oozie. service. JPAService. create. db. schema

False

Creates Oozie DB.

If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.

If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.


Change "false" to "true.

9. Load the database.

Run the following command:

#/Usr/oozie/bin/ooziedb. sh create-sqlfile oozie. SQL-run

The following occurs:

Setting CATALINA_OPTS = "$ CATALINA_OPTS-Xmx1024m"

Validate DB Connection

DONE

Check DB schema does not exist

DONE

Check OOZIE_SYS table does not exist

DONE

Create SQL schema

DONE

Create OOZIE_SYS table

DONE

Oozie DB has been created for Oozie version '3. 100'

The SQL commands have been written to: oozie. SQL


That is, the creation is successful. At this time, you can see an oozie. SQL file.

10, add the original Hadoop package, here you need to copy the hadoop-core-1.1.1.jar and commons-configuration-1.6.jar these two JAR packages to the directory:./oozie-server/webapps/oozie/WEB-INF/lib.

11. Copy oozie. war to./oozie-server/webapps.

12. To change the permissions, you need to change the permissions of the oozie folder and all its sub-files:

# Chown-R hadoop: hadoop oozie

13. Enable Oozie and run the following command:

$./Bin/oozied. sh run

14. view the web Console

Enter the following command:

$ Oozie admin-oozie http: // 192.168.137.2: 11000/oozie-status

When the result is displayed:

System mode: NORMAL

The management interface of oozie is displayed by entering http: // 192.168.137.2: 11000/Oozie in the browser. See figure 2.1:


. 1 Oozie Management Interface

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.