Hadoop tutorial (1)

Source: Internet
Author: User
Tags hadoop mapreduce hadoop fs

Cloudera, compilation: importnew-Royce Wong

Hadoop starts from here! Join me in learning the basic knowledge of using hadoop. The following describes how to use hadoop to analyze data with hadoop tutorial!

This topic describes the most important things that users face when using the hadoop mapreduce (hereinafter referred to as Mr) framework. Mapreduce is composed of client APIs and runtime environment. Client APIS is used to compile Mr programs. The runtime environment provides the Mr runtime environment. The API has two versions, that is, the old API and the new API. Two versions are available: mrv1 and mrv2. This tutorial is based on the old API and mrv1.

The old API is in the org. Apache. hadoop. mapred package, and the new API is in org. Apache. hadoop. mapreduce.

Prerequisites

Make sure that you have correctly installed, configured, and run CDH properly.

Mr Overview

Hadoop mapreduce is an open-source computing framework. applications running on it can process massive data (P-level datasets) concurrently in clusters with thousands of nodes ).

Mr Jobs usually divide datasets into independent chunks, which are processed by map tasks in parallel. The Mr framework sorts the map output and then uses the output as the input to reduce tasks for processing. A typical method is that the input and final output of a job are stored in the Distributed File System (HDFS.

During deployment, the computing node is also a storage node, and the Mr framework and HDFS run on the same cluster. This configuration allows the framework to effectively Schedule Tasks on cluster nodes. Of course, the data to be analyzed already exists on the cluster, this also leads to high aggregation bandwidth in the cluster (we usually need to pay attention to this feature when planning and deploying the cluster ).

The mapreduce framework is composed of a jobracker (JT) and several tasktracker (TT). (If the jobtracker ha feature is used in cdh4, two jobtracers exist, and only one of them is active, the other is in inactive status as standby ). Jobtracker is responsible for scheduling tasks on all tasktrackers, monitoring tasks, and re-executing failed tasks. All tasktrackers execute the tasks assigned by jobtracker.

The application must at least specify the input and output paths, and provide map and reduce functions that implement appropriate interfaces and (or) abstract classes. These paths, functions, and other task parameters form the job configuration object ). The hadoop task client submits the task (jar package or executable program) and configures the object to JT. JT distributes task implementation and configuration objects to several TT (allocated by JT), schedules and monitors tasks, and returns status and detection information to the client.

Hadoop is implemented by javatm. You can use Java, other JVM-based languages, or the following methods to develop Mr applications:

  • Hadoop streaming-allows you to create and run Mr tasks using any executable program (such as shell scripts) as Mapper and (or) reducer.
  • Hadoop pigs-a c ++ API compatible with swig (not based on jnitm) to implement mapreduce applications.
Input and Output

The mapreuce Framework processes key-Value Pair internally, because Mr treats the input of a task as a set of key-value pairs and the output as a set of key-value pairs. The type of the output kV pair can be different from that of the input kV pair.

The key and vaue types must be serializable within the framework, so the key value must implement the writable interface. At the same time, the key class must implement writablecomparable so that the framework can sort keys.

The typical Mr task input and output types are as follows:
(Input) k1-v1-> map-> k2-v2-> combine-> k2-v2-> reduce-> k3-v3 (output)

Classic wordcount1.0

I have to mention wordcount when playing hadoop. This is also used in the original CDH text. Of course, this is also used here as an example :)

To put it simply, wordcount is used to calculate the number of occurrences of each word in the input data. Because it's simple enough, it's a classic, and it's a perfect match with Hello world!

Source code:

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657 package org.myorg;     import java.io.IOException;    import java.util.*;     import org.apache.hadoop.fs.Path;    import org.apache.hadoop.conf.*;    import org.apache.hadoop.io.*;    import org.apache.hadoop.mapred.*;    import org.apache.hadoop.util.*;     public class WordCount {       public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        private Text word = new Text();         public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {          String line = value.toString();          StringTokenizer tokenizer = new StringTokenizer(line);          while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            output.collect(word, one);          }        }      }       public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {          int sum = 0;          while (values.hasNext()) {            sum += values.next().get();          }           output.collect(key, new IntWritable(sum));        }      }       public static void main(String[] args) throws Exception {        JobConf conf = new JobConf(WordCount.class);        conf.setJobName("wordcount");         conf.setOutputKeyClass(Text.class);        conf.setOutputValueClass(IntWritable.class);         conf.setMapperClass(Map.class);        conf.setCombinerClass(Reduce.class);        conf.setReducerClass(Reduce.class);         conf.setInputFormat(TextInputFormat.class);        conf.setOutputFormat(TextOutputFormat.class);         FileInputFormat.setInputPaths(conf, new Path(args[0]));        FileOutputFormat.setOutputPath(conf, new Path(args[1]));         JobClient.runJob(conf);      }    }

Compile wordcount. Java

$ mkdir wordcount_classes $ javac -cp classpath -d wordcount_classes WordCount.java

Classpath:

  • Cdh4/usr/lib/hadoop/:/Usr/lib/hadoop/client-0.20/
  • Cdh3/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u4-core.jar

Into a jar package:

1 $ jar -cvf wordcount.jar -C wordcount_classes/ .

Assume that:

  • /User/cloudera/wordcount/input HDFS path
  • /User/cloudera/wordcount/output HDFS path

Create data in text format and move it to HDFS:

1234 $ echo "Hello World Bye World" > file0$ echo "Hello Hadoop Goodbye Hadoop" > file1$ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input$ hadoop fs -put file* /user/cloudera/wordcount/input

Run wordcount:

1 $ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output

View the output after running:

123456 $ hadoop fs -cat /user/cloudera/wordcount/output/part-00000Bye 1Goodbye 1Hadoop 2Hello 2World 2

Mr applications can use the-files parameter to specify multiple files in the current working directory. Multiple files are separated by commas. -The libjars parameter can be used to add multiple jar packages to the map and reduce classpath. -The Archive parameter can be used to pass the ZIP or jar package in the working path as parameters, while the ZIP or jar package name is used as a link ). For more detailed command information, refer to hadoop command guide.

Run the wordcount command using the-libjars and-files parameters:

1 hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar input output

For more information, see the wordcount application.

Lines 14-26 implement Mapper, and use the map method (rows 18-25) to process a row of records at a time. The record format is the specified textinputformat (row 49 ). Then, a record line is separated into words by spaces. The separator uses the stringtokenizer class, and then publishes kV pairs in the form of <word, 1>.

In the preceding input, the first map will output:

Second map output:

In this article, we will go deep into the large number of map outputs in this task and study how to control the output in a more fine-grained manner.

Wordcount specifies the combiner in row 46. Therefore, after the output of each map is sorted by key, the local combiner (consistent with CER) is used for local aggregation.

Final output of the first map: <bye, 1>

Second map output: <goodbye, 1>

Reducer implementation (28-36 rows) uses the reduce method (29-35) to only overlay values and calculate the number of occurrences of each word.

Therefore, the final output of wordcount is: <bye, 1> <goodbye, 1>

The run method specifies various parameters of the task through the jobconf object, such as the input/output path, Kv type, and input/output format. The program submits and starts monitoring the progress of the task by calling jobclient. runjob (line 55.

Later we will learn more about jobconf, jobclient, and tool.

Cloudera, compilation: importnew-Royce Wong

Link: http://www.importnew.com/4248.html

 

Related Articles
  • Configure Impala and mapreduce for the cluster
  • Hadoop getting started tutorial (4): Submit and monitor Mr Jobs, input and output control, and usage of features
  • Hadoop tutorial (III): important Mr Running Parameters
  • Hadoop tutorial (II)
  • How to improve the performance and security of short-circuit local reads in hadoop
  • 10 reasons you don't need hadoop for data analysis-you must test other alternatives before using hadoop.
  • Discussion on the Performance of hadoop and Cassandra
  • Use hadoop and Birt to visualize massive data volumes
  • Applications that track hadoop tasks for Mac
  • Create a multi-master node with high availability for hbase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.