Hadoop Fast Combat (ii)--build distributed

Source: Internet
Author: User
Tags hadoop fs

Prerequisite, need to have Linux environment

See Linux environment Preparation One, install Hadoop 1, upload Hadoop

The Hadoop I use is hadoop-2.4.1.tar.gz, uploading it to the user's directory and creating the app directory under the user directory for easy management. Extract Hadoop into this directory

2, Hadoop directory description

Access to the app can see the hadoop-2.4.1 directory, access to see

Bin: Executable directory
sbin: System executables
etc: Profile
Lib: libraries associated with the local platform, where the local platform is Linux
share: Core jar packages and documentation
Second, modify the configuration file

The configuration files are in the Hadoop folder in the ETC directory 1, JDK changes in hadoop-env.sh

This file does not need to be changed as a whole, but if your JDK is user-defined and not global, you will probably not get it, at which point you need to modify the JDK directory and write the specified JDK directory dead in the file.
2, Core-site.xml

Specifying the default File system
Specify file Storage root directory

Configuration label is empty by default, add the following

<configuration>
<property>
<name>fs.defaultFS</name>
//IP address can also be replaced
with host name <value>hdfs://192.168.49.31:9000/</value>
</property>
<property>
<name >hadoop.tmp.dir</name>
<value>/home/fangxin/app/hadoop-2.4.1/tmp/</value>
</ Property>
</configuration>

FS.DEFAULTFS represents the default file system
hdfs://192.168.49.31:9000 is the HDFs system, on 31 servers, listening on port 9000

HADOOP.TMP.DIR Specifies the file storage root directory where Hadoop creates the Dfs file directory, Namenode creates the Namenode folder, and Datanode creates the Datanode folder.
If this parameter is configured, then the TMP of Hadoop is used as the root directory, which is emptied after the restart of the directory. 3, hdfs-site.html

Replica Quantity configuration

<configuration>
<property>
<name>dfs.replication</name>
<value>1 </value>
</property>
</configuration>

Currently only one node, so backup 1 points 4, Mapred-site.xml

There are mapred-site.xml.tempate in the catalogue, change to Mapred-site.xml

MV Mapred-site.xml.template Mapred-site.xml

Set the default resource scheduling framework for MapReduce, I use yarn

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value >yarn</value>
</property>
</configuration>
5, Yarn-site.xml

The yarn framework is also clustered, with the master node and from the node, whose main node is called ResourceManager

Configuration
Primary Cluster Name
Intermediate data scheduling mechanism, here with mapreduce_shuffle mechanism

<configuration>
<!--Site specific YARN configuration Properties-->
<property>
< name>yarn.resourcemanager.hostname</name>
<value></value>
</property>
</configuration>
6, from the machine configuration slaves (cluster configuration items)

The host computer that will run from the machine is configured in the slaves file of the hosts. When you start Master, you will start the configuration from the machine.

If Secondarynamenode and Namenode are separated, you need to use the configuration file master, otherwise you do not need

HADOOP2
HADOOP3
Other Configuration

You can configure host to the server so that you can avoid using IP address three to start configuring Hadoop environment variables.

If you want to use the Hadoop command as a global variable, you first need to configure

Command
vi/etc/profile

//modify content
java_home=/usr/java/jdk1.7.0_79
export Hadoop_home=/home/fangxin/app /hadoop-2.4.1
Export path= $JAVA _home/bin: $PATH: $HADOOP _home/sbin: $HADOOP _home/bin
format Namenode
Command
Hadoop namenode-format
//formatting process will let you answer once, note to use uppercase Y

If successful, you can see the following information

If the error is correct, return to step two to check the configuration again.

After success, the file storage directory "TMP" that we previously configured in Core-site.xml is generated and has a DFS subdirectory.
Start HDFs

Sbin directory, execute command

start-dfs.sh

In turn, the Namenode, Datanode, Secondarynamenode, the period to confirm the identity of several times, as prompted to do.

After the boot is complete.

This command is in the JDK bin directory, Java process statistics
JSP

Command View process

You can also pass

Netstat-nltp

See how many ports each process listens to

For example, the namenode of process 25898 listens for 9000 ports. Some processes listen to more than one port because different traffic uses different ports. Start Yarn

start-yarn.sh

ResourceManager and NodeManager two processes can be seen after startup
Hadoop common command line

View File Directories
Hadoop fs-ls hdfs://192.168.49.31:9000/or Hadoop fs-ls/

Note: If you have the following error
Java HotSpot (TM) 64-bit Server VM warning:you have loaded library/home/fangxin/app/hadoop-2.4.1/lib/ native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM would try to fix the stack guard now.

It's probably Hadoop. The default compiled library is 32-bit. The interim measures are as follows
the following hadoop-env.sh
at the end add the following two lines of
export hadoop_common_lib_native_dir=${hadoop_prefix}/lib/ Native
export hadoop_opts= "$HADOOP _opts-djava.net.preferipv4stack=true-djava.library.path= $HADOOP _prefix/ Lib "

but there are still warnings:
17/02/08 09:03:25 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable

Deposit in File

Hadoop fs-put file name hdfs://192.168.49.31:9000/

Files that are stored in Hadoop are located in the depths of the file storage directory tmp, in the Data/current/finalized folder, and are cut into multiple stores if they are greater than 128M. Remove file

Hadoop fs-get/filename

where/represents the root directory of Hadoop, if in multiple directories, search in sequence to create a table of contents

Create a directory command
Hadoop fs-mkdir/directory B
View Files
Hadoop fs-cat/File directory/file name
MapReduce Test RunCount the number of words
1, prepare a file, can be a text file, content can be some words, such as hello,good and so on
2, the preparation of the file uploaded to the Hadoop space
3, into the share folder under the MapReduce
4. Execute the WORDCOUNT program in jar package Hadoop-mapreduce-example
Hadoop jar Hadoop-mapreduce-examples-2.4.1.jar wordcount/input directory/output directory

If you follow the steps, you can count the contents of your text words
Calculate pi

Hadoop jar Hadoop-mapreduce-examples-2.4.1.jar Pi 10 10
Four, SSH remote login

Because the bottom of Hadoop is authenticated remotely through SSH, if we do not configure the SSH key, we need to continue to perform various remote operations through the input password, which is cumbersome. Even a single node of local Hadoop operations.

What is the ssh:secure shell (Secure Shell protocol). Log on to another Linux host from a Linux host.

See key authentication authorization mechanism for SSH Telnet

Linux SSH configuration steps:
1. Locally generated key pair
2. Remote SCP copy public key to server
3, service side, in the. SSH directory to create files

$ Touch Authorized_keys

4, append the public key information to the above file

Append information operation
$ cat ~/id_rsa.pub >authorized_keys

5, Authorized_keys file must be only the user can read and write, the other user groups are not authorized to take effect.

$ chmod Authorized_keys

6, remote login from the client, if not password, description of successful configuration

$ SSH Service End name

Imitate the above configuration, although pseudo-distributed only single node, you can configure the local SSH protocol, startup and shutdown Hadoop do not need a password

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.