Hadoop2 pseudo-distributed deployment and hadoop2 deployment

Source: Internet
Author: User
Tags hdfs dfs hadoop fs

Hadoop2 pseudo-distributed deployment and hadoop2 deployment

I. Introduction

Ii. installation and deployment

Iii. Run the hadoop example and test the deployment Environment

4. Notes


I. Introduction

 

Hadoop is a distributed system infrastructure developed by the Apache Foundation. The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data volumes. HDFS features high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware. It also provides high throughput) it is suitable for applications with large data sets. HDFS relaxed (relax) POSIX requirements, and can access the data in the file system in the form of a stream. MapReduce provides computing for massive data. Hadoop is a platform that is easy to develop and run to process large-scale data.

 

HDFS is developed in Java. Therefore, Namenode or Datanode can be deployed on any machine that supports Java. HDFS adopts the master/slave architecture. an HDFS cluster consists of a Namenode and a certain number of Datanodes. Namenode is a central server responsible for managing the file system namespace and client access to files. A Datanode in a cluster is generally a node responsible for managing the storage on its node. HDFS exposes the namespace of the file system, allowing you to store data in the form of files. Internally, a file is actually divided into one or more data blocks, which are stored in a group of Datanode. Namenode executes the namespace operations of the file system, such as opening, closing, renaming a file or directory. It is also responsible for determining the ing between data blocks and specific Datanode nodes. Datanode is responsible for processing read/write requests from the file system client. Create, delete, and copy data blocks under the unified scheduling of Namenode. As shown in:

 

Hadoop Map/Reduce is a simple software framework based on which applications can run on a large cluster consisting of thousands of commercial machines, it also processes T-level datasets in parallel in a reliable and fault-tolerant manner. A Map/Reduce job usually divides the input dataset into several independent data blocks, and the map task processes them in full parallel. The framework sorts the map output first and then inputs the result to the reduce task. Generally, input and output of jobs are stored in the file system. The entire framework is responsible for task scheduling and monitoring, and re-execution of failed tasks.

 

Hadoop is a distributed computing platform that allows users to easily architecture and use. You can easily develop and run applications that process massive data on Hadoop. It has the following advantages:

1. High reliability. Hadoop's ability to store and process data by bit is trustworthy.

2. high scalability. Hadoop distributes data among available computer clusters and completes computing tasks. These cluster clusters can be easily expanded to thousands of nodes.

3. high efficiency. Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

4. High Fault tolerance. Hadoop can automatically save multiple copies of data and automatically reallocate failed tasks.

5. low cost. Hadoop is open-source compared with All-in-One machines, commercial data warehouses, QlikView, Yonghong Z-Suite, and other data marketplaces, and the project's software costs are greatly reduced.

 

The Hadoop pseudo-distributed mode simulates Hadoop distribution on a single machine with limited conditions. Therefore, the Hadoop pseudo-distributed mode is deployed on linux virtual machines to simulate Hadoop distribution.

 

Ii. installation and deployment

JDK installation and configuration on a Linux virtual machine is not described here. I have introduced it in my previous articles.

 Step 1: Configure ssh password-free Authentication

1. Check whether SSH has been installed.


2. Create a New SSH key based on an empty password and enable password-less Login
# Ssh-keygen-t rsa-p'-f ~ /. Ssh/id_rsa
# Cat ~ /. Ssh/id_rsa.pub> ~ /. Ssh/authorized_keys

3. Test whether password login is required.

# Ssh localhost:

This indicates that the installation is successful. When you log on for the first time, you will be asked if you want to continue the link. Enter yes to enter.

Note:

During the Hadoop installation process, If you do not configure password-free login, you need to enter the password to log on to DataNode every time Hadoop is started. We usually do the cluster, so this is not convenient for us to operate.

Step 2: deploy hadoop

1. Download http://hadoop.apache.org/

Here we download hadoop-2.6.0.tar.gz

2. mkdir/usr/local/hadoop

3. # tar-zxvf hadoop-2.6.0.tar.gz // unzip

4. # vi/etc/profile // configure hadoop

Export HADOOP_HOME =/usr/local/hadoop/hadoop-2.6.0

Export PATH = $ HADOOP_HOME/bin: $ PATH

Export HADOOP_LOG_DIR =$ {HADOOP_HOME}/logs

# Source/etc/profile // make the configuration take effect

5. configuring core-site.xml, hdfs-site.xml and mapred-site.xml

# Cd hadoop-2.6.0 // enter the hadoop decompressed directory

Etc/hadoop/core-site.xml <configuration> <property> <name> fs. default. name </name> <value> hdfs: // 192.168.74.129: 9000 </value> </property> <name> hadoop. tmp. dir </name> <value>/usr/local/hadoop/hadoop-2.6.0/tmp </value> </property> </configuration> etc/hadoop/hdfs-site.xml <configuration> <property> <name> dfs. name. dir </name> <value>/usr/local/hadoop/hadoop-2.6.0/hdfs/name </value> <description> storage path for namenode </description> </property> <property> <name> dfs. data. dir </name> <value>/usr/local/hadoop/hadoop-2.6.0/hdfs/data </value> <description> storage path for datanode </description> </property> <property> <name> dfs. replication </name> <value> 1 </value> </property> </configuration> etc/hadoop/mapred-site.xml <configuration> <property> <name> mapred. job. tracker </name> <value> hdfs: // 192.168.74.129: 9001 </value> </property> <name> mapred. local. dir </name> <value>/usr/local/hadoop/hadoop-2.6.0/mapred/local </value> <description> store the path used by mapred </description> </property> <property> <name> mapred. system. dir </name> <value>/usr/local/hadoop/hadoop-2.6.0/mapred/system </value> <description> stores mapred system-level paths, shared </description> </property> </configuration>
6. Modify the jdk path of the hadoop-env.sh

# Vi etc/hadoop/hadoop-env.sh

Export JAVA_HOME =/usr/java/jdk1.7.0 _ 67

7. # hadoop namenode-format // format the HDFS file system to create an empty large file system.

Executed successfully:

Generate a directory for the core-site.xml, hdfs-site.xml, and mapred-site.xml ,:


Step 3: Start the hadoop Service

1. # sbin/start-all.sh // start sbin/stop-all.sh // close


2. # verify jps //

 

3. View hadoop information in the browser

Http: // 192.168.74.129: 50070

 

Http: // 192.168.74.129: 8088 hadoop Management page


4. logs can be viewed in cd/usr/local/hadoop/hadoop-2.6.0/logs


Iii. Run the hadoop example and test the deployment Environment


1. Run the following code on the official website to check whether the hadoop environment we deployed is correct:

import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {  public static class TokenizerMapper       extends Mapper<Object, Text, Text, IntWritable>{    private final static IntWritable one = new IntWritable(1);    private Text word = new Text();    public void map(Object key, Text value, Context context                    ) throws IOException, InterruptedException {      StringTokenizer itr = new StringTokenizer(value.toString());      while (itr.hasMoreTokens()) {        word.set(itr.nextToken());        context.write(word, one);      }    }  }  public static class IntSumReducer       extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result = new IntWritable();    public void reduce(Text key, Iterable<IntWritable> values,                       Context context                       ) throws IOException, InterruptedException {      int sum = 0;      for (IntWritable val : values) {        sum += val.get();      }      result.set(sum);      context.write(key, result);    }  }  public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    Job job = Job.getInstance(conf, "word count");    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    FileInputFormat.addInputPath(job, new Path(args[0]));    FileOutputFormat.setOutputPath(job, new Path(args[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);  }}

Upload to the directory decompressed by hadoop ,:


2. check whether a directory exists.

# Hadoop fs-ls // we haven't created a directory after the deployment, so no directory is displayed:

3.create a new file0.txt text, input content, that is, the words we want to count ,:


4. Create Input and Output Directories

Create a folder on hdfs first

# Bin/hdfs dfs-mkdir-p/user/root/input

# Bin/hdfs dfs-mkdir-p/user/root/output

5. Upload the text to the hdfs input directory.

# Bin/hdfs dfs-put/usr/local/hadoop/hadoop-2.6.0/test/*/user/root/input // upload the tes/file0 file to hdfs/user/root /input

6. View

# Bin/hdfs dfs-cat/user/root/input/file0

 

7. Compile the statistical Java WordCount. java

# Bin/hadoop com. sun. tools. javac. Main WordCount. java


8. Create a jar for compiled WordCount. class

Jar cf wc. jar WordCount *. class

9. Execute statistics

# Bin/hadoop jar wc. jar WordCount/user/root/input/user/root/output/count


10. View output

# Bin/hdfs dfs-cat/user/root/output/count/part-r-00000



4. Notes

1. # This error is reported when the HostName is incorrectly configured during hadoop namenode-format formatting ,:

Analysis:

# Hsotname

The corresponding hostname cannot be found in the/etc/hosts file.

Solution:

1) vi/etc/hosts // modify the configured ip Address

2) # vi/etc/sysconfig/network // modify the hostname

3)/etc/rc. d/init. d/network restart // restart

If not, restart the Linux virtual machine.


2. # hadoop fs-ls will appear

Java HotSpot (TM) Server VM warning: Youhave loaded library/usr/local/hadoop/hadoop-2.6.0/lib/native/libhadoop. so.1.0.0 which might havedisabled stack guard. the VM will try to fix the stack guard now.

It's highly recommended that you fix thelibrary with 'execstack-c <libfile> ', or link it with'-z noexecstack '.


Solution:

# Vi/etc/profile

Export HADOOP_COMMON_LIB_NATIVE_DIR = $ HADOOP_HOME/lib/native
Export HADOOP_OPTS = "-Djava. library. path = $ HADOOP_HOME/lib"

# Source/etc/profile

After you execute hadoop fs-ls, the following issues will not occur ,:







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.