Use yum source to install the CDH Hadoop Cluster

Source: Internet
Author: User

Use yum source to install the CDH Hadoop Cluster

This document mainly records the process of using yum to install the CDH Hadoop cluster, including HDFS, Yarn, Hive, and HBase.This article uses the CDH5.4 version for installation, so the process below is for the CDH5.4 version.

0. Environment Description

System Environment:

  • Operating System: CentOS 6.6
  • Hadoop version:CDH5.4
  • JDK version:1.7.0_71
  • Run User: root

The role of each node in the cluster is planned as follows:

192.168.56.121        cdh1     NameNode、ResourceManager、HBase、Hive metastore、Impala Catalog、Impala statestore、Sentry 192.168.56.122        cdh2     DataNode、SecondaryNameNode、NodeManager、HBase、Hive Server2、Impala Server192.168.56.123        cdh3     DataNode、HBase、NodeManager、Hive Server2、Impala Server

Cdh1 acts as the master node and other nodes act as the slave node.

1. Preparations

Before installing the Hadoop cluster, make the following preparations. When modifying the configuration file, we recommend that you modify the file on one node and then synchronize it to other nodes. For example, for hdfs and yarn, modify and synchronize on the NameNode node. For HBase, select a node for synchronization. To synchronize the configuration file and start the service on multiple nodes, we recommend that you configure ssh password-less login.

1.1 configure hosts

CDH requires IPv4, which is not supported by IPv6. Disable the IPv6 method:

$ vim /etc/sysctl.conf#disable ipv6net.ipv6.conf.all.disable_ipv6=1net.ipv6.conf.default.disable_ipv6=1net.ipv6.conf.lo.disable_ipv6=1

Make it take effect:

$ sysctl -p

Finally, confirm whether it is disabled:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv61

1. Set the hostname. Take cdh1 as an example:

$ hostname cdh1

2. Ensure/etc/hostsContains ip addresses and FQDN. If you are using DNS, save the information/etc/hostsIt is not necessary, but a best practice.

3. Ensure/etc/sysconfig/networkContainshostname=cdh1

4. Check the network and run the following command to check whether the hostname and corresponding ip address are configured correctly.

Rununame -aCheck whether the hostname matcheshostnameCommand running result:

$ uname -aLinux cdh1 2.6.32-358.23.2.el6.x86_64 #1 SMP Wed Oct 16 18:37:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux$ hostnamecdh1

Run/sbin/ifconfigView ip Address:

$ ifconfigeth1      Link encap:Ethernet  HWaddr 08:00:27:75:E0:95            inet addr:192.168.56.121  Bcast:192.168.56.255  Mask:255.255.255.0......

Install bind-utils before running the host command:

$ yum install bind-utils -y

Run the following command to check whether the hostname and ip address match:

$ host -v -t A `hostname`Trying "cdh1"...;; ANSWER SECTION:cdh1. 60 INA192.168.56.121

5. When configuring node names in all configuration files of hadoop, use hostname and

1.2 disable Firewall
$ Setenforce 0 $ vim/etc/sysconfig/selinux # modify SELINUX = disabled # Clear iptables $ iptables-F
1.3 clock synchronization build clock synchronization server

Select the cdh1 node as the clock synchronization server and the other nodes as the client synchronization time to this node. Install ntp:

$ yum install ntp

Modify the configuration file on cdh1/etc/ntp.conf:

Restrict default ignore // ntp cannot be modified or queried by default, and do not receive special packets restrict 127.0.0.1 // give the local machine all permissions restrict 192.168.56.0 mask limit 255.0 notrap nomodify // give the local machine the permission for synchronization time server 192.168.56.121 # local clockdriftfile/var/lib /ntp/driftfudge 127.127.1.0 stratum 10

Start ntp:

# Set boot start $ chkconfig ntpd on $ service ntpd start

Ntpq is used to monitor ntpd operations. Standard NTP mode 6 is used to control message mode and communicate with the NTP server.

ntpq -pQuery the NTP server in the network and display the relationship between the client and each server.

$ ntpq -p     remote           refid      st t when poll reach   delay   offset  jitter==============================================================================*LOCAL(1)        .LOCL.           5 l    6   64    1    0.000    0.000   0.000
  • "*": The NTP server for the response and the most accurate server.
  • "+": The NTP server that responds to this query request.
  • "Blank (Space)": NTP server with no response.
  • "Remote": the name of the NTP server that responds to this request.
  • "Refid": name of the higher-level server used by the NTP server.
  • "St": the level of the NTP server that is responding to the request.
  • "When": the number of seconds since the last successful request.
  • "Poll": the number of seconds between the current request clock.
  • Offset: The Time offset between the host and the synchronized time source through the NTP clock, in milliseconds (MS ).
Client Configuration

Perform the following operations on the cdh2 and cdh3 nodes:

$ ntpdate cdh1

Ntpd usually takes about five minutes to start time synchronization. Therefore, when ntpd is just started, it cannot provide the clock service normally. The error "no server suitable for synchronization found" is returned ". It takes 5 minutes to start.

You can use the crond service for scheduled time calibration.

# The Network Time is automatically calibrated at every day. 00 1 * root/usr/sbin/ntpdate 192.168.56.121>/root/ntpdate. log 2> & 1
1.4 install jdk

JDK1.7 is required for CDH5.4. For the JDK installation process, see the online document.

1.5 set local yum Source

CDH official yum source address in http://archive.cloudera.com/cdh4/RedHat/6/x86_64/cdh/cloudera-cdh4.repo or http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo, Please modify the path to the baseurl in the file according to your installed cdh version.

You can download the cdh4 warehouse archive or the cdh5 warehouse archive from here.

Because I am using the centos operating system, I download the cdh5 centos6 compressed package here, download it, decompress it to the ftp service path, and configure the local yum source of cdh:

[hadoop]name=hadoopbaseurl=ftp://cdh1/cdh/5/enabled=1gpgcheck=0

For the operating system yum source, we recommend that you download the centos dvd and configure a local yum source.

2. install and configure HDFS

According to the node planning at the beginning of the article, cdh1 is a NameNode node, cdh2 is a SecondaryNameNode node, and cdh2 and cdh3 are a DataNode node.

Install hadoop-hdfs-namenode on the cdh1 node:

$ yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-namenode

Install hadoop-hdfs-secondarynamenode on the cdh2 Node

$ yum install hadoop-hdfs-secondarynamenode -y

Install hadoop-hdfs-datanode on cdh2 and cdh3 nodes

$ yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y

For how to configure NameNode HA, see configure hdfs ha in CDH. We recommend that you do not configure hdfs ha for the moment.

2.1 modify the hadoop configuration file

In/etc/hadoop/conf/core-site.xmlSet infs.defaultFSAttribute value. This attribute specifies the node of the NameNode and whether the file system is file or hdfs. format:hdfs://<namenode host>:<namenode port>/The default file system isfile:///:

<property> <name>fs.defaultFS</name> <value>hdfs://cdh1:8020</value></property>

In/etc/hadoop/conf/hdfs-site.xmlSet indfs.permissions.superusergroupAttribute, which specifies the Super User of hdfs. The default value is hdfs. You can change it to hadoop:

<property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value></property>

For more configuration information, see Apache Cluster Setup

2.2 specify the local file directory

The default file path and permission requirements in hadoop are as follows:

Default directory owner permission path hadoop. tmp. dirhdfs: hdfsdrwx ------/var/hadoopdfs. namenode. name. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/namedfs. datanode. data. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/datadfs. namenode. checkpoint. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/namesecondary

Meaning you can configure only in hdfs-site.xm lhadoop.tmp.dirYou can also configure the preceding paths separately. Here the way you configure them separately is used, and the hdfs-site.xml is configured as follows:

<property> <name>dfs.namenode.name.dir</name> <value>file:///data/dfs/nn</value></property><property> <name>dfs.datanode.data.dir</name><value>file:///data/dfs/dn</value></property>

Manually create on NameNodedfs.name.dirOrdfs.namenode.name.dirLocal directory:

$ mkdir -p /data/dfs/nn

Manually create on DataNodedfs.data.dirOrdfs.datanode.data.dirLocal directory:

$ mkdir -p /data/dfs/dn

Modify the directory owner:

$ chown -R hdfs:hdfs /data/dfs/nn /data/dfs/dn

The hadoop process is automatically set.dfs.data.dirOrdfs.datanode.data.dir,dfs.name.dirOrdfs.namenode.name.dirThe default permission is 755, which must be manually set to 700:

$ Chmod 700/data/dfs/nn # Or $ chmod go-rx/data/dfs/nn

Note: You can set multiple local directories of DataNode.dfs.datanode.failed.volumes.toleratedThe value of the parameter indicates that the directory cannot exceed this number.

2.3 configure SecondaryNameNode

To configure SecondaryNameNode, you must/etc/hadoop/conf/hdfs-site.xmlAdd the following parameters:

dfs.namenode.checkpoint.check.perioddfs.namenode.checkpoint.txnsdfs.namenode.checkpoint.dirdfs.namenode.checkpoint.edits.dirdfs.namenode.num.checkpoints.retained

In/etc/hadoop/conf/hdfs-site.xmlAdd the following configuration to set cdh2 to SecondaryNameNode:

<property>  <name>dfs.secondary.http.address</name>  <value>cdh2:50090</value></property>

Set multiple secondarynamenode. For more information, see multi-host-secondarynamenode-configuration.

2.4 enable the recycle bin Function

The recycle bin function is disabled by default. We recommend that you enable it. In/etc/hadoop/conf/core-site.xmlAdd the following two parameters:

  • fs.trash.intervalThe value of this parameter is the time interval, in minutes. The default value is 0, indicating that the recycle bin function is disabled. This value indicates how long files are stored in the recycle bin. If this parameter is configured on the server, the client configuration is ignored. If this parameter is disabled on the server, check whether the client has configured this parameter;
  • fs.trash.checkpoint.interval. The default value is 0. This value indicates the check interval of the recycle bin. The value must be smallerfs.trash.intervalThis value is configured on the server. If this value is set to 0, usefs.trash.interval.
2.5 (optional) Configure Load Balancing for DataNode Storage

In/etc/hadoop/conf/hdfs-site.xmlConfigure the following three parameters:

  • dfs.datanode.fsdataset. volume.choosing.policy
  • dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
  • dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction

For more information, see Optionally configure DataNode storage balancing.

2.6 enable WebHDFS

Install On the NameNode node:

$ yum install hadoop-httpfs -y

Then modify the/etc/hadoop/conf/core-site.xml configuration proxy User:

<property>  <name>hadoop.proxyuser.httpfs.hosts</name>  <value>*</value>  </property>  <property>  <name>hadoop.proxyuser.httpfs.groups</name>  <value>*</value>  </property>
2.7 configure LZO

Download the repo file/etc/yum.repos.d/:

  • If you have installed CDH4, please download Red Hat/CentOS 6
  • If you have installed CDH5, please download Red Hat/CentOS 6

Then install lzo:

$ yum install hadoop-lzo* impala-lzo  -y

Finally/etc/hadoop/conf/core-site.xmlAdd the following Configuration:

<property>  <name>io.compression.codecs</name>  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value></property><property>  <name>io.compression.codec.lzo.class</name>  <value>com.hadoop.compression.lzo.LzoCodec</value></property>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.