Hadoop Big Data deployment

Last Update:2018-11-02 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop Big Data deployment 1. System Environment configuration: 1. Disable the firewall and SELinux

Disable Firewall:

systemctl stop firewalldsystemctl disable firewalld

Set SELinux to disable

# cat /etc/selinux/config SELINUX=disabled

2. Configure the NTP Time Server

# yum -y install ntpdate# crontab -l*/5 * * * * /usr/sbin/ntpdate 192.168.1.1 >/dev/null 2>&1

Change the IP address to the available time server IP Address

3. Modify System Restrictions

# cat /etc/security/limits.conf* soft nproc  100000* hard nproc  100000* soft nofile 102400* hard nofile 102400hadoop soft nproc  100000hadoop hard nproc  100000hadoop soft nofile 102400hadoop hard nofile 102400

4. Create a hadoop user

groupadd -g 1002 hadoopuseradd -u 1002 -g hadoop hadoop

5. Configure hosts

[[email protected] ~]# cat /etc/hosts127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4::1         localhost localhost.localdomain localhost6 localhost6.localdomain6192.168.24.43 hadoop1192.168.24.216 hadoop2192.168.24.7 hadoop3

6. Distribution Public Key

# su - hadoop$ ssh-keygen$ ssh-copy-id [email protected]$ ssh-copy-id [email protected]$ ssh-copy-id [email protected]

Make sure that the public keys of all nodes exist on a node.

7. Install JDK

# yum -y install jdk-8u171-linux-x64.rpm# java -versionjava version "1.8.0_171"Java(TM) SE Runtime Environment (build 1.8.0_171-b11)Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

8. Install Scala

Scala is a multi-paradigm programming language designed to integrate various features of object-oriented and functional programming. Scala runs on a Java Virtual Machine and is compatible with existing Java programs. Scala source code is compiled into Java bytecode, so it can run on JVM and can call the existing Java class library.

cd /apptar -zxvf /home/Software/scala-2.11.12.tgz  -C . mv scala-2.11.12 scala

9. Install snappy

Snappy is a compressed/decompressed library. It targets neither the maximum compression nor any other compression libraries; on the contrary, it targets extremely high speed and reasonable compression. For example, compared with the fastest zlib mode, for most input, snappy is faster than an order of magnitude, but the size of the generated compressed file is 20% to 100%.

yum -y install automake autoconf libtool openssl openssl-devel gcc gcc-c++tar -zxvf  snappy-1.1.3.tar.gzcd snappy-1.1.3./autogen.sh./configuremake & make install

10. Install lzo and lzop

Lzo is a lossless compression library written in ansi c. He can provide very fast compression and decompression functions. Decompression does not require memory support. Even if a large compression ratio is used to compress the data slowly, the data can still be decompressed very quickly. Lzo follows the gnu gpl license.
Lzo is very suitable for real-time data compression and decompression, that is, it is more concerned with the operation speed, rather than the compression ratio.
Lzo is written in ansi c, and the compressed data is also designed as a cross-platform format.

tar -xvf lzo-2.06.tar.gzcd lzo-2.06./configure --enable-sharedmake && make install

Lzop is a program written in the lzo library. It can be compressed and decompressed directly using shell commands.

tar -xvf lzop-1.03.tar.gzcd lzop-1.03./configuremake && make install

Ii. zookeeper Cluster

Zookeeper has three installation modes: standalones mode for Single Node installation; pseudo cluster mode: Start Multiple zookeeper instances on one host; cluster mode: requires an odd number of servers, at least three zookeeper instances are activated for each instance.

1. Unzip and install zookeepr

su - hadoopmkdir /apptar -zxvf zookeeper-3.4.10.tar.gz -C /app/cd /appsudo mv zookeeper-3.4.10 zookeepermkdir data logs

2. Modify the zoo. cfg file

[[email protected] ~]$ vim /app/zookeeper/conf/zoo.cfg tickTime=2000initLimit=20syncLimit=10dataDir=/app/zookeeper/datadataLogDir=/app/zookeeper/logsclientPort=2181server.1=hadoop1:2888:3888server.2=hadoop2:2888:3888server.3=hadoop3:2888:3888

Initlimit: The maximum heartbeat time between follower and leader during link initialization. 20*2000 means 40 seconds.
Synclimit: the maximum length of time for sending a message between a leader and a follower, that is, 20 seconds.
Server. X = A: B: C where X is a number, indicating the number of the server. A is the IP address of the server. b: configure the port used by the server and the leader in the cluster to exchange messages. c: configure the port used for election leader

3. Modify myid

In/app/zookeeper/data/Add a myid file and write X in server. X in the preceding configuration file.

[[email protected] ~]$ cat /app/zookeeper/data/myid 1

4. Modify the log output path of zookeeper:

Modify/app/zookeeper/bin/zkEnv.shUnderZOO_LOG_DIRChange to the path written in the configuration file/app/zookeeper/logs.

if [ "x${ZOO_LOG_DIR}" = "x" ]then    ZOO_LOG_DIR="/app/zookeeper/logs"fi

5. Start and debug zookeeper

Start:

$ Zkserver. Sh start

View status:

$ Zkserver. Sh status

[[email protected] ~]$ zkServer.sh statusZooKeeper JMX enabled by defaultUsing config: /app/zookeeper/bin/../conf/zoo.cfgMode: follower

Ii. hadoop ha Installation

Hadoop is divided into two major versions: 1.0 and 2.0. You can find the specific differences by yourself. This article mainly uses hadoop2.0. Hadoop2.0's ecosystem mainly includes the following core projects: HDFS yarn mapreduce.

1. Unzip and install

sudo tar -zxvf hadoop-2.9.1.tar.gz -C /app/$ pwd/app/hadoop/etc/hadoop$ lscapacity-scheduler.xml      httpfs-env.sh            mapred-env.shconfiguration.xsl           httpfs-log4j.properties  mapred-queues.xml.templatecontainer-executor.cfg      httpfs-signature.secret  mapred-site.xmlcore-site.xml               httpfs-site.xml          mapred-site.xml.templatehadoop-env.cmd              kms-acls.xml             slaveshadoop-env.sh               kms-env.sh               ssl-client.xml.examplehadoop-metrics2.properties  kms-log4j.properties     ssl-server.xml.examplehadoop-metrics.properties   kms-site.xml             yarn-env.cmdhadoop-policy.xml           log4j.properties         yarn-env.shhdfs-site.xml               mapred-env.cmd           yarn-site.xml

2. Modify hadoop environment variables (hadoop-env.sh)

Export hadoop_heapsize = 16196 export java_home =/usr/Java/1.8.0 _ 171 export java_library_path =/APP/hadoop-2.9.1/lib/nativeexport hadoop_opts = "-djava. library. path =/APP/hadoop-2.9.0/lib/native Note: If in centos 6 environment, the path after the variable name must use double quotation marks, otherwise the variable cannot be found when it is started later.

3. Modify core-site.xml

<configuration><property>  <name>fs.defaultFS</name>  <value>hdfs://myhadoop</value></property><property>   <name>ha.zookeeper.quorum</name>   <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value></property><property>   <name>hadoop.tmp.dir</name>   <value>/app/hadoop/tmp</value></property><property>    <name>io.file.buffer.size</name>    <value>131072</value></property><property>    <name>io.compression.codecs</name>                             <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property><property>    <name>io.compression.codec.lzo.class</name>    <value>com.hadoop.compression.lzo.LzoCodec</value></property><property>    <name>hadoop.proxyuser.hadoop.hosts</name>    <value>*</value></property><property>    <name>hadoop.proxyuser.hadoop.groups</name>    <value>*</value></property></configuration>

3. Modify hdfs-site.xml

<configuration><property>  <name>dfs.nameservices</name>  <value>myhadoop</value></property><property>  <name>dfs.ha.namenodes.myhadoop</name>  <value>nn1,nn2</value></property><property>  <name>dfs.namenode.rpc-address.myhadoop.nn1</name>  <value>hadoop1:8020</value></property><property>  <name>dfs.namenode.rpc-address.myhadoop.nn2</name>  <value>hadoop2:8020</value></property><property>  <name>dfs.namenode.http-address.myhadoop.nn1</name>  <value>hadoop1:50070</value></property><property>  <name>dfs.namenode.http-address.mycluster.nn2</name>  <value>hadoop2:50070</value></property><property>  <name>dfs.journalnode.edits.dir</name>  <value>/app/hadoop/qjournal</value></property><property>  <name>dfs.namenode.shared.edits.dir</name>  <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/myhadoop</value></property><property>  <name>dfs.client.failover.proxy.provider.myhadoop</name>  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value></property><property>   <name>dfs.ha.fencing.methods</name>   <value>sshfence</value></property><property>   <name>dfs.ha.fencing.ssh.private-key-files</name>   <value>/home/hadoop/.ssh/id_rsa</value></property> <property>   <name>dfs.ha.automatic-failover.enabled</name>   <value>true</value> </property><property>    <name>dfs.namenode.name.dir</name>    <value>file:/app/hadoop/dfs/name,file:/hadoop/dfs/name</value></property><property>    <name>dfs.datanode.data.dir</name>    <value>file:/app/hadoop/dfs/data</value></property><property>    <name>dfs.datanode.handler.count</name>    <value>100</value></property><property>    <name>dfs.namenode.handler.count</name>    <value>1024</value></property><property>    <name>dfs.datanode.max.xcievers</name>    <value>8096</value></property></configuration>

3. Modify yarn-site.xml

<configuration><property>  <name>yarn.resourcemanager.ha.enabled</name>  <value>true</value></property><property>  <name>yarn.resourcemanager.cluster-id</name>  <value>cluster1</value></property><property>  <name>yarn.resourcemanager.ha.rm-ids</name>  <value>rm1,rm2</value></property><property>  <name>yarn.resourcemanager.hostname.rm1</name>  <value>hadoop1</value></property><property>  <name>yarn.resourcemanager.hostname.rm2</name>  <value>hadoop2</value></property><property>  <name>yarn.resourcemanager.webapp.address.rm1</name>  <value>hadoop1:8088</value></property><property>  <name>yarn.resourcemanager.webapp.address.rm2</name>  <value>hadoop2:8088</value></property><property>  <name>yarn.resourcemanager.zk-address</name>  <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value></property></configuration>

5. Modify mapred-site.xml

<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property>    <name>mapreduce.jobhistory.address</name>    <value>hadoop1:10020</value></property><property>    <name>mapreduce.jobhistory.webapp.address</name>    <value>hadoop1:19888</value></property><property>    <name>mapreduce.job.tracker</name>    <value>hdfs://hadoop1:8021</value></property><property>    <name>mapreduce.reduce.shuffle.parallelcopies</name>    <value>50</value></property><property>    <name>mapreduce.map.java.opts</name>    <value>-Xmx4096M</value></property><property>    <name>mapreduce.reduce.java.opts</name>    <value>-Xmx8192M</value></property><property>    <name>mapreduce.map.memory.mb</name>    <value>4096</value></property><property>    <name>mapreduce.reduce.memory.mb</name>    <value>8192</value></property><property>    <name>mapreduce.map.output.compress</name>    <value>true</value></property><property>    <name>mapred.child.env</name>    <value>JAVA_LIBRARY_PATH=/app/hadoop-2.9.1/lib/native</value></property><property>    <name>mapreduce.map.output.compress.codec</name>    <value>com.hadoop.compression.lzo.LzoCodec</value></property><property>    <name>mapreduce.task.io.sort.mb</name>    <value>512</value></property><property>    <name>mapreduce.task.io.sort.factor</name>    <value>100</value></property><property>    <name>mapred.reduce.tasks</name>    <value>4</value></property><property>    <name>mapred.map.tasks</name>    <value>20</value></property><property>     <name>mapred.child.java.opts</name>     <value>-Xmx4096m</value> </property><property>     <name>mapreduce.reduce.shuffle.memory.limit.percent</name>     <value>0.1</value></property><property>     <name>mapred.job.shuffle.input.buffer.percent</name>     <value>0.6</value></property></configuration>

6. Modify the yarn-env.sh environment and add Environment Variables

Set the yarn heap size after the yarn-env.sh file. Append the following sentence
Yarn_heapsize = 4000

Add environment variables:

$ tail .bash_profileexport JAVA_HOME=/usr/java/jdk1.8.0_171-amd64export HADOOP_HOME=/app/hadoopexport ZOOKPEER_HOME=/app/zookeeperexport LIBRAY_PATH=$HADOOP_HOME/lib/nativeexport SCALA_HOME=/app/scalaexport PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKPEER_HOME/bin:$SCALA_HOME/bin

7. Cluster startup and monitoring

Install psmisc; otherwise, automatic switch fails:

yum -y install psmisc

Start the cluster:

#1. run zkserver on all zookeeper nodes. sh start #1.1 is executed on the leader and initialized with zookeeper. This will create a znode to implement an automatic backup system on zookeeper. HDFS zkfc-formatzk #1.2 If you are setting up a fresh HDFS cluster, you should first run the format command, on one of namenodes. HDFS namenode-format #2. automatic Start HDFS service start-dfs.sh #2.1 Note: To manually manage your cluster services, you must start your namenode through zkfc deamon. The command is as follows: hadoop-daemon.sh -- script HDFS start zkfc #3. start Resourcemanagerstart-yarn.sh in hadoop #4. start standby resourcemanageryarn-daemon.sh start ResourceManager # other commands: # Start namenodehadoop-daemon.sh start/stop namenode # Start datanodehadoop-daemon.sh start/stop namenode

View status:

# View the status of each node $ jps2049 running successfully without using nodemanager1_5 JPs # check the status of HDFS haadmin-getallservicestate # check the status of nn1/nn2 HDFS haadmin-getservicestate nn1hdfs haadmin- getservicestate nn2 # view the active/standby status of the ResourceManager cluster $ yarn rmadmin-getallservicestate hadoop1: 8033 active hadoop2: 8033 standby # view the status of each node in the ResourceManager cluster $ yarn rmadmin-getservicestate rm1active $ yarn rmadmin-getservicestate rm2standby

Hadoop Cluster Monitoring Port:

NameNode: http://namenode_host:50070ResourceManager: http://resourcemanager_host:8088MapReduce JobHistory Server: http://jobistoryserver_host:19888

Hadoop Big Data deployment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More