Use yum source to install the CDH Hadoop Cluster
This document mainly records the process of using yum to install the CDH Hadoop cluster, including HDFS, Yarn, Hive, and HBase.This article uses the CDH5.4 version for installation, so the process below is for the CDH5.4 version
.
0. Environment Description
System Environment:
- Operating System: CentOS 6.6
- Hadoop version:
CDH5.4
- JDK version:
1.7.0_71
- Run User: root
The role of each node in the cluster is planned as follows:
192.168.56.121 cdh1 NameNode、ResourceManager、HBase、Hive metastore、Impala Catalog、Impala statestore、Sentry 192.168.56.122 cdh2 DataNode、SecondaryNameNode、NodeManager、HBase、Hive Server2、Impala Server192.168.56.123 cdh3 DataNode、HBase、NodeManager、Hive Server2、Impala Server
Cdh1 acts as the master node and other nodes act as the slave node.
1. Preparations
Before installing the Hadoop cluster, make the following preparations. When modifying the configuration file, we recommend that you modify the file on one node and then synchronize it to other nodes. For example, for hdfs and yarn, modify and synchronize on the NameNode node. For HBase, select a node for synchronization. To synchronize the configuration file and start the service on multiple nodes, we recommend that you configure ssh password-less login.
1.1 configure hosts
CDH requires IPv4, which is not supported by IPv6. Disable the IPv6 method:
$ vim /etc/sysctl.conf#disable ipv6net.ipv6.conf.all.disable_ipv6=1net.ipv6.conf.default.disable_ipv6=1net.ipv6.conf.lo.disable_ipv6=1
Make it take effect:
$ sysctl -p
Finally, confirm whether it is disabled:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv61
1. Set the hostname. Take cdh1 as an example:
$ hostname cdh1
2. Ensure/etc/hosts
Contains ip addresses and FQDN. If you are using DNS, save the information/etc/hosts
It is not necessary, but a best practice.
3. Ensure/etc/sysconfig/network
Containshostname=cdh1
4. Check the network and run the following command to check whether the hostname and corresponding ip address are configured correctly.
Rununame -a
Check whether the hostname matcheshostname
Command running result:
$ uname -aLinux cdh1 2.6.32-358.23.2.el6.x86_64 #1 SMP Wed Oct 16 18:37:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux$ hostnamecdh1
Run/sbin/ifconfig
View ip Address:
$ ifconfigeth1 Link encap:Ethernet HWaddr 08:00:27:75:E0:95 inet addr:192.168.56.121 Bcast:192.168.56.255 Mask:255.255.255.0......
Install bind-utils before running the host command:
$ yum install bind-utils -y
Run the following command to check whether the hostname and ip address match:
$ host -v -t A `hostname`Trying "cdh1"...;; ANSWER SECTION:cdh1. 60 INA192.168.56.121
5. When configuring node names in all configuration files of hadoop, use hostname and
1.2 disable Firewall
$ Setenforce 0 $ vim/etc/sysconfig/selinux # modify SELINUX = disabled # Clear iptables $ iptables-F
1.3 clock synchronization build clock synchronization server
Select the cdh1 node as the clock synchronization server and the other nodes as the client synchronization time to this node. Install ntp:
$ yum install ntp
Modify the configuration file on cdh1/etc/ntp.conf
:
Restrict default ignore // ntp cannot be modified or queried by default, and do not receive special packets restrict 127.0.0.1 // give the local machine all permissions restrict 192.168.56.0 mask limit 255.0 notrap nomodify // give the local machine the permission for synchronization time server 192.168.56.121 # local clockdriftfile/var/lib /ntp/driftfudge 127.127.1.0 stratum 10
Start ntp:
# Set boot start $ chkconfig ntpd on $ service ntpd start
Ntpq is used to monitor ntpd operations. Standard NTP mode 6 is used to control message mode and communicate with the NTP server.
ntpq -p
Query the NTP server in the network and display the relationship between the client and each server.
$ ntpq -p remote refid st t when poll reach delay offset jitter==============================================================================*LOCAL(1) .LOCL. 5 l 6 64 1 0.000 0.000 0.000
- "*": The NTP server for the response and the most accurate server.
- "+": The NTP server that responds to this query request.
- "Blank (Space)": NTP server with no response.
- "Remote": the name of the NTP server that responds to this request.
- "Refid": name of the higher-level server used by the NTP server.
- "St": the level of the NTP server that is responding to the request.
- "When": the number of seconds since the last successful request.
- "Poll": the number of seconds between the current request clock.
- Offset: The Time offset between the host and the synchronized time source through the NTP clock, in milliseconds (MS ).
Client Configuration
Perform the following operations on the cdh2 and cdh3 nodes:
$ ntpdate cdh1
Ntpd usually takes about five minutes to start time synchronization. Therefore, when ntpd is just started, it cannot provide the clock service normally. The error "no server suitable for synchronization found" is returned ". It takes 5 minutes to start.
You can use the crond service for scheduled time calibration.
# The Network Time is automatically calibrated at every day. 00 1 * root/usr/sbin/ntpdate 192.168.56.121>/root/ntpdate. log 2> & 1
1.4 install jdk
JDK1.7 is required for CDH5.4. For the JDK installation process, see the online document.
1.5 set local yum Source
CDH official yum source address in http://archive.cloudera.com/cdh4/RedHat/6/x86_64/cdh/cloudera-cdh4.repo or http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo, Please modify the path to the baseurl in the file according to your installed cdh version.
You can download the cdh4 warehouse archive or the cdh5 warehouse archive from here.
Because I am using the centos operating system, I download the cdh5 centos6 compressed package here, download it, decompress it to the ftp service path, and configure the local yum source of cdh:
[hadoop]name=hadoopbaseurl=ftp://cdh1/cdh/5/enabled=1gpgcheck=0
For the operating system yum source, we recommend that you download the centos dvd and configure a local yum source.
2. install and configure HDFS
According to the node planning at the beginning of the article, cdh1 is a NameNode node, cdh2 is a SecondaryNameNode node, and cdh2 and cdh3 are a DataNode node.
Install hadoop-hdfs-namenode on the cdh1 node:
$ yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-namenode
Install hadoop-hdfs-secondarynamenode on the cdh2 Node
$ yum install hadoop-hdfs-secondarynamenode -y
Install hadoop-hdfs-datanode on cdh2 and cdh3 nodes
$ yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y
For how to configure NameNode HA, see configure hdfs ha in CDH. We recommend that you do not configure hdfs ha for the moment.
2.1 modify the hadoop configuration file
In/etc/hadoop/conf/core-site.xml
Set infs.defaultFS
Attribute value. This attribute specifies the node of the NameNode and whether the file system is file or hdfs. format:hdfs://<namenode host>:<namenode port>/
The default file system isfile:///
:
<property> <name>fs.defaultFS</name> <value>hdfs://cdh1:8020</value></property>
In/etc/hadoop/conf/hdfs-site.xml
Set indfs.permissions.superusergroup
Attribute, which specifies the Super User of hdfs. The default value is hdfs. You can change it to hadoop:
<property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value></property>
For more configuration information, see Apache Cluster Setup
2.2 specify the local file directory
The default file path and permission requirements in hadoop are as follows:
Default directory owner permission path hadoop. tmp. dirhdfs: hdfsdrwx ------/var/hadoopdfs. namenode. name. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/namedfs. datanode. data. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/datadfs. namenode. checkpoint. dirhdfs: hdfsdrwx ------ file: // $ {hadoop. tmp. dir}/dfs/namesecondary
Meaning you can configure only in hdfs-site.xm lhadoop.tmp.dir
You can also configure the preceding paths separately. Here the way you configure them separately is used, and the hdfs-site.xml is configured as follows:
<property> <name>dfs.namenode.name.dir</name> <value>file:///data/dfs/nn</value></property><property> <name>dfs.datanode.data.dir</name><value>file:///data/dfs/dn</value></property>
Manually create on NameNodedfs.name.dir
Ordfs.namenode.name.dir
Local directory:
$ mkdir -p /data/dfs/nn
Manually create on DataNodedfs.data.dir
Ordfs.datanode.data.dir
Local directory:
$ mkdir -p /data/dfs/dn
Modify the directory owner:
$ chown -R hdfs:hdfs /data/dfs/nn /data/dfs/dn
The hadoop process is automatically set.dfs.data.dir
Ordfs.datanode.data.dir
,dfs.name.dir
Ordfs.namenode.name.dir
The default permission is 755, which must be manually set to 700:
$ Chmod 700/data/dfs/nn # Or $ chmod go-rx/data/dfs/nn
Note: You can set multiple local directories of DataNode.dfs.datanode.failed.volumes.tolerated
The value of the parameter indicates that the directory cannot exceed this number.
2.3 configure SecondaryNameNode
To configure SecondaryNameNode, you must/etc/hadoop/conf/hdfs-site.xml
Add the following parameters:
dfs.namenode.checkpoint.check.perioddfs.namenode.checkpoint.txnsdfs.namenode.checkpoint.dirdfs.namenode.checkpoint.edits.dirdfs.namenode.num.checkpoints.retained
In/etc/hadoop/conf/hdfs-site.xml
Add the following configuration to set cdh2 to SecondaryNameNode:
<property> <name>dfs.secondary.http.address</name> <value>cdh2:50090</value></property>
Set multiple secondarynamenode. For more information, see multi-host-secondarynamenode-configuration.
2.4 enable the recycle bin Function
The recycle bin function is disabled by default. We recommend that you enable it. In/etc/hadoop/conf/core-site.xml
Add the following two parameters:
fs.trash.interval
The value of this parameter is the time interval, in minutes. The default value is 0, indicating that the recycle bin function is disabled. This value indicates how long files are stored in the recycle bin. If this parameter is configured on the server, the client configuration is ignored. If this parameter is disabled on the server, check whether the client has configured this parameter;
fs.trash.checkpoint.interval
. The default value is 0. This value indicates the check interval of the recycle bin. The value must be smallerfs.trash.interval
This value is configured on the server. If this value is set to 0, usefs.trash.interval
.
2.5 (optional) Configure Load Balancing for DataNode Storage
In/etc/hadoop/conf/hdfs-site.xml
Configure the following three parameters:
dfs.datanode.fsdataset. volume.choosing.policy
dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction
For more information, see Optionally configure DataNode storage balancing.
2.6 enable WebHDFS
Install On the NameNode node:
$ yum install hadoop-httpfs -y
Then modify the/etc/hadoop/conf/core-site.xml configuration proxy User:
<property> <name>hadoop.proxyuser.httpfs.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.httpfs.groups</name> <value>*</value> </property>
2.7 configure LZO
Download the repo file/etc/yum.repos.d/
:
- If you have installed CDH4, please download Red Hat/CentOS 6
- If you have installed CDH5, please download Red Hat/CentOS 6
Then install lzo:
$ yum install hadoop-lzo* impala-lzo -y
Finally/etc/hadoop/conf/core-site.xml
Add the following Configuration:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value></property><property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value></property>