Practice 1: Install hadoop in a single-node instance cdh4 cluster of pseudo-distributed hadoop

Source: Internet
Author: User
Tags builtin hadoop ecosystem

Hadoop consists of two parts:

Distributed File System (HDFS)

Distributed Computing framework mapreduce

The Distributed File System (HDFS) is mainly used for the Distributed Storage of large-scale data, while mapreduce is built on the Distributed File System to perform distributed computing on the data stored in the distributed file system.


Describes the functions of nodes in detail.

Namenode:

1. There is only one namenode in the hadoop cluster. It is the hub of the entire system and manages the HDFS directory tree and Related File metadata information. The information is "fsimage

(HDFS metadata image file) and editlog (HDFS file change log) files are stored on the local disk. These files are flushed and constructed when HDFS is restarted. Namenode is also responsible for monitoring

The health status of each datanode. If a datanode is damaged, remove the datanode from HDFS and back up the data above it again.

Secondary namenode:

1. The most important task of secondary namenode is not to perform hot backup for namenode metadata, but to regularly merge fsimage and edits logs and transmit them to namenode,

To reduce the pressure on namenode, namenode does not merge fsimage and edits and stores the files on the disk. Instead, it is handed over to secondary namenode.

Datanode:

1. A datanode is installed on each slave node, which is responsible for actual data storage and regularly reports data information to namenode. Datanode uses a fixed block size as the basic unit to organize file content,

The default block size is 64 MB (GFS is also 64 MB ). When a user uploads a file larger than 64 MB, the file will be cut into several blocks and stored in different datanode (more easily distributed );

To ensure data reliability, the same block is written to several different datanode (3 by default in the configuration) in the pipeline.

Jobtracker:

1. hadoop's map/reduce scheduler is responsible for communicating with tacktracker to allocate computing tasks and track the task progress.

Tasktracker:

1. hadoop scheduler is responsible for the specific start and execution of MAP and reduce tasks.


Hadoop is composed of two independent clusters, HDFS cluster and mapreduce cluster.

Computer = CPU + hard disk [hadoop = mapreduce + HDFS]

Mapreduce is composed of two functions: map and reduce. They first extract key-value pairs from the map function, and then reduce collects statistics on them.

Address planning: 192.168.0.200 hadoop version introduction hadoop version: hadoop-2.0.0-cdh4.7.0.tar.gzJDK version: jdk-6u45-linux-x64-rpm.bin

1. Download the JDK and cdh4.7 source code packages

All components of the hadoop ecosystem: trends? Q = download/jdk7u60/archive/B15/binaries/jre-7u60-ea-bin-b15-linux-x64-16_apr_2014.tar.gzJdk package name: jdk-7u60-ea-bin-b15-linux-x64-16_apr_2014.tar.gz

Ii. Decompress hadoop and JDK

tar xf hadoop-2.0.0-cdh4.7.0.tar.gz -C /usr/local/cd /usr/local/ln -s hadoop-2.0.0-cdh4.7.0 hadoop2.0-cdhecho "export PATH=\$PATH:/usr/local/hadoop2.0-cdh/bin/" > /etc/profile.d/hadoop.shsource /etc/profile.d/hadoop.sh hadoop versiontar xf jdk-7u60-bin-linux-x64-16.tar.gz -C /usr/local/cd /usr/local/jdk1.7.0_60echo "export PATH=\$PATH:/usr/local/jdk1.7.0_60/bin" > /etc/profile.d/jdk.shsource /etc/profile.d/jdk.sh java -version


3. Create an hadoo account

Note:

-1. hadoop should run as a common user rather than a root user. Then, the host and the Group are both common users and run the hadoop service as normal users.

-2. Even in pseudo-distribution mode, when starting all services with the script start-all.sh, it will also try to SSH connect localhost self-connection also requires the account and password, configure common user SSH Trust Mechanism


1. Create a common user

useradd hduser echo ‘hduser‘|passwd --stdin ‘hduser‘ > /dev/nullchown -R hduser.hduser /usr/local/hadoop2.0-cdh/

2. Common users configure SSH dual-host Interconnection

# Su-hduser $ ssh-keygen-t rsa-p' $ ssh-copy-ID-I/home/hduser /. SSH/id_rsa.pub [email protected] $ SSH localhost (how can I enter a password without prompting me. It proves that SSH is connected correctly)

4. Configure hadop

Important files

Hadoop-env.sh sets hadoop Environment Variables

Core-site.xml core configuration file

Mapred-site.xml mapreduce configuration file

Hdfs-site.xml HDFS configuration file

Log4j. Properties


1. Go to the configuration file directory of hadoop and configure it.

$ cd /usr/local/hadoop2.0-cdh/etc/hadoop/

2. Configure hadoop Java environment variables

$ Vim hadoop-env.sh modify the following row export java_home =/usr/local/jdk1.7.0 _ 60 test method $ hadoop version if the output is as follows, hadoop 2.0.0-cdh4.7.0subversion file is successful: /// var/lib/Jenkins/workspace/cdh4.7.0-packaging-hadoop/build/cdh4/hadoop/2.0.0-cdh4.7.0/source/hadoop-common-Project/hadoop-common-R powered by Jenkins on Wed May 28 09:41:14 PDT 2014 from source with checksum f60207d0daa9f943f253cc8932d598c8this command was run using/usr/local/hadoop-2.0.0-cdh4.7.0/share/hadoop/common/hadoop-common-2.0.0-cdh4.7.0.jar

3. Edit core-site.xml

# Mkdir-P/hadoop/temp # temporary hadoop data storage directory, which generates intermediate data

# Chown-r hduser. hduser/hadoop/temp

$ Vim core-site.xml.

<configuration>  <property>          <name>hadoop.tmp.dir</name>          <value>/hadoop/temp</value>  </property>  <property>          <name>fs.default.name</name>          <value>hdfs://localhost:8020</value>   </property>         </configuration>

4. Edit mapred-site.xml

$ CP mapred-site.xml.template mapred-site.xml

$ Vim mapred-site.xml.

<configuration>  <property>      <name>mapred.job.tracker</name>      <value>localhost:8021</value>  </property></configuration>

5. Edit hdfs-site.xml

Note: The default value of replication is 3. If no modification is made, an error will be reported if there are fewer than three datanode nodes.

# Vim hdfs-site.xm.

<configuration>  <property>      <name>dfs.replication</name>       <value>1</value>   </property></configuration>

5. Start the hadoop Service


1. format the Name node,

For example, in a common file system, HDFS file systems must be formatted before they can be used after the metadata structure is created. Run the following command as an hduser.

$ Hadoop namenode-format

The output information is shown below.

14/08/09 14:34:32 info namenode. namenode: startup_msg: /*************************************** * ******************** startup_msg: starting namenodestartup_msg: host = localhost. localdomain/127.0.0.1startup _ MSG: ARGs = [-format] startup_msg: version = 0.20.2startup _ MSG: Build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-r 911707; compiled by 'chrisdo 'on Fri Feb 19 08:07:34 UTC 2010 **************************** * *****************************/14/08/09 14:34:32 info namenode. fsnamesystem: fsowner = hduser, hduser14/08/09 14:34:32 info namenode. fsnamesystem: supergroup = supergroup14/08/09 14:34:32 info namenode. fsnamesystem: ispermissionenabled = true14/08/09 14:34:33 info common. storage: Image File of size 96 saved in 0 seconds.14/08/09 14:34:33 info common. storage: storage directory/hadoop/temp/dfs/name has been successfully formatted. # If this row is displayed, the formatting is successful. 14/08/09 14:34:33 info namenode. namenode: shutdown_msg: /*************************************** * ******************** shutdown_msg: shutting down namenode at localhost. localdomain/127.0.0.1 ************************************* ***********************/

2. Start the hadoop Service

$./Start-all.sh

The output information is shown below.

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh14/08/13 20:23:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableStarting namenodes on [localhost]localhost: starting namenode, logging to /usr/local/hadoop-2.0.0-cdh4.7.0/logs/hadoop-hduser-namenode-localhost.outlocalhost: starting datanode, logging to /usr/local/hadoop-2.0.0-cdh4.7.0/logs/hadoop-hduser-datanode-localhost.outStarting secondary namenodes [0.0.0.0]The authenticity of host ‘0.0.0.0 (0.0.0.0)‘ can‘t be established.RSA key fingerprint is 46:b9:7c:11:db:75:93:ad:f1:26:f0:a7:4d:00:40:20.Are you sure you want to continue connecting (yes/no)? yes0.0.0.0: Warning: Permanently added ‘0.0.0.0‘ (RSA) to the list of known hosts.0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.0.0-cdh4.7.0/logs/hadoop-hduser-secondarynamenode-localhost.out14/08/13 20:24:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicablestarting yarn daemonsstarting resourcemanager, logging to /usr/local/hadoop-2.0.0-cdh4.7.0/logs/yarn-hduser-resourcemanager-localhost.outlocalhost: starting nodemanager, logging to /usr/local/hadoop-2.0.0-cdh4.7.0/logs/yarn-hduser-nodemanager-localhost.out
Run the JPS command to view the running hadoop process. $ JPs | grep-IV "JPs" the output information should be shown below 1628 namenode2027 resourcemanager1892 secondarynamenode1742 datanode2123 nodemanager

3. hadoop process listening address and port

When hadoop is started, two server processes are run. One is the RPC server used for communication between hadoop processes; the other is that the HTTP server that allows administrators to view the information pages of various processes in the hadoop cluster temporarily ignores the RPC server. The following describes the attributes of the HTTP server that can be used to define each HTTP Server: mapred. job. tracker. HTTP. addrss: the HTTP server address and port of jobtracker. Default Value: 0.0.0.0: 50030; mapred. task. tracker. HTTP. address: the HTTP server address and port of tasktracker. The default value is 0.0.0.0: 50060; DFS. HTTP. address: the HTTP server address and port of namenode. The default value is 0.0.0.0: 50070; DFS. datanode. HTTP. address: the HTTP server address and port of datanode. The default value is 0.0.0.0: 50075; DFS. secondary. HTTP. address: the HTTP server address and port of secondarynamenode. The default value is 0.0.0.0: 50090. The preceding HTTP server can be accessed directly through a browser to obtain information about the corresponding process. The access path is http: // server_ip: port view jobtracker Page Status Monitoring http: // 192.168.0.200: 50030/jobtracker. JSP view namenode Page Status Monitoring http: // 192.168.0.200: 8088/dfshealth. JSP

6. View hadoop node status monitoring

1. view the namenode node status

Namenode

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/45/CD/wKioL1PrY-_SdbZzAAR2gVpXL9U103.jpg "style =" width: 1000px; Height: 518px; "Title =" 1.png" width = "1000" Height = "518" border = "0" hspace = "0" vspace = "0" alt = "wKioL1PrY-_SdbZzAAR2gVpXL9U103.jpg"/>

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/45/CC/wKiom1PrYtqz7VG2AAJo80RIe_E417.jpg "style =" width: 1000px; Height: pixel PX; "Title =" 2.png" width = "1000" Height = "366" border = "0" hspace = "0" vspace = "0" alt = "wkiom1prytqz7vg2aajo80rie_e%.jpg"/>


2. View secondarynamenode

Secondarynamenode

650) This. length = 650; "src =" http://s3.51cto.com/wyfs02/M01/45/CC/wKiom1PrY9eTh-qnAAL0mskW8bY941.jpg "Title =" 3.png" width = "1000" Height = "323" border = "0" hspace = "0" vspace = "0" style = "width: 1000px; Height: 323px; "alt =" wKiom1PrY9eTh-qnAAL0mskW8bY941.jpg "/>

All applications

650) This. length = 650; "src =" http://s3.51cto.com/wyfs02/M01/45/CC/wKiom1PrZqmx7DT2AAPKXsmnui8377.jpg "Title =" 4.png" width = "1000" Height = "364" border = "0" hspace = "0" vspace = "0" style = "width: 1000px; Height: 364px; "alt =" wkiom1przqmx7dt2aapkxsmnui8377.jpg "/>



This article from "Zheng Yansheng" blog, please be sure to keep this source http://467754239.blog.51cto.com/4878013/1539636

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.