Write scalable, distributed data-intensive programs and basics
Understanding Hadoop and MapReduce
Write and run a basic MapReduce program
1. What is Hadoop
Hadoop is an open-source framework for writing and running distributed applications to handle large-scale data.
What makes Hadoop unique is the following points:
Convenient--hadoop run on a large cluster of general commercial machines, or on top of cloud computing services;
Robust--hadoop is committed to running on general commercial hardware, and its architecture assumes that hardware will frequently fail;
The extensible--hadoop can be linearly scaled to handle larger datasets by increasing the cluster nodes;
Simple--hadoop run the user to quickly write efficient parallel code.
2. Understanding Distributed Systems and Hadoop
Understand the contrast between distributed systems (scale out) and large stand-alone servers (scaling up), taking into account the cost/performance of existing I/O technologies.
Understand the difference between Hadoop and other distributed architectures ([email protected]):
The Hadoop design philosophy is that code migrates to data, while the [email protected] design philosophy is data migration.
The program to be run is a few orders of magnitude smaller than the data, and it's easier to move the data on the network than it would be to load the code on it, instead of moving the executable code to the machine where the data resides without moving the data.
3. Compare SQL database and Hadoop
SQL (Structured Query language) is designed for structured data, and many of Hadoop's first applications target unstructured data such as text. Let's make a more detailed comparison of Hadoop with a typical SQL database from a specific perspective:
Scale out instead of scaling up--the cost of extending a commercial relational database is more expensive
Using key/value pairs instead of relational tables--hadoop use key/value pairs as basic data units that are flexible enough to handle less structured data types
Using functional programming (MapReduce) instead of declarative queries (SQL)-in MapReduce, the actual data processing steps are specified by you, much like an execution plan for the SQL engine
Offline processing instead of on-line processing--hadoop is designed for off-line processing and large-scale data analysis, and is not suitable for online transaction processing patterns that are randomly read and written to several records
4. Understanding MapReduce
MapReduce is a data processing model, and the greatest advantage is that it can be easily extended to process data on multiple compute nodes;
In the MapReduce model, the data processing primitives are called Mapper and reducer;
Decomposition of a data processing application for mapper and reducer is sometimes cumbersome, but once a MapReduce form is written, an application can be expanded to run on hundreds of, thousands of, or even tens of thousands of machines in the cluster by simply modifying the configuration.
[To expand a simple program]
A small amount of document processing: For each document, Word segmentation is used to extract the words one by one, and for each word, add 1 to the corresponding item in the Multiset WordCount, and the last display () function prints out all the entries in the WordCount.
A large number of document processing methods: The work is distributed across multiple machines, each machine processes different parts of the document, and when all the machines are complete, the second processing phase merges the results.
Some of the details may prevent the program from working as expected, such as excessive document reads causing the bandwidth performance of the central storage server to keep up, and multiple collection WordCount entries exceeding the computer's memory capacity. In addition, the second phase of only one computer processing wordcount tasks, prone to bottlenecks, so can be used in a distributed way, in some way to split it to multiple computers, so that it can run independently, that needs to be in the first phase after the WordCount partition, Allows each computer in the second stage to process only one partition.
To make it work on a distributed computer cluster, you need to add the following features:
Store files on many computers (first stage)
Write a disk-based hash table so that processing is not limited by memory capacity
Dividing intermediate data from the first stage (i.e., wordcount)
Shuffle these partitions to the appropriate computer on the second stage
The MapReduce program execution is divided into two main stages, mapping and reducing, each defined as a data processing function, called Mapper and Reducer, respectively. In the mapping phase, MapReduce acquires the input data and loads the data unit into the mapper; In the reduce phase, reducer processes all outputs from mapper and gives the final result. In short, mapper means filtering and converting the input so that reducer can complete the aggregation.
In addition, in order to extend the distributed Word statistics program, we have to write the partitioning and shuffling functions.
Writing applications in the MapReduce framework is the process of customizing Mapper and Reducer, and the following is the complete flow of data:
The input of the app must be organized into a list of key/value pairs (<k1,v1>);
A list of key/value pairs is split, and each individual key/value pair <k1,v1> is processed by calling Mapper's map function;
The output of all mapper is aggregated into a huge list of <k2,v2> pairs;
Each reducer processes each of the aggregated <k2,list (v2), and outputs <k3,v3>.
5. Use Hadoop to count words--run the first program
Usage:hadoop [-config Configdir] COMMAND
Command here is one of the following:
Namenode-format formatting the Dfs file system
Secondarynamenode running DFS for the second Namenode
Namenode Running DFS Namenode
Datanode running a DFS Datanode
Dfsadmin running a DFS admin client
Fsck runs a check tool for a DFS file system
FS runs an ordinary file system User Client
Balancer running a cluster load balancing tool
Jobtracker running the MapReduce jobtracker node
Pipes running a pipes job
Tasktracker running a mapreduce tasktracker node
Job Processing MapReduce Jobs
Version Print versions
Jar <jar> Run a jar file
Distcp <srcurl> <desturl> recursively copy files or directories
Archive-archivename NAME <src>* <dest> generate a Hadoop profile
Daemonlog Gets or sets the log level for each daemon
CLASSNAME run a class named CLASSNAME most commands use the w/o parameter
The Help information is typed.
The command format for running the word Statistics sample program is as follows:
Hadoop jar Hadoop-*-examples.jar WordCount [-M <maps>] [-R Reduces] input output
The command form for compiling the modified Word statistics program is as follows:
Javac-classpath hadoop-*-core.jar-d playground/classes Playground/src/wordcount.java
JAR-CVF playground/src/wordcount.jar-c playground/classes/
The command format for running the modified Word count program is as follows:
Hadoop jar Playground/wordcount.jar org.apache.hadoop.examples.WordCount Input Output
Code Listing Wordcount.java
package org.apache.hadoop.examples;import java.io.ioexception;import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration;import org.apache.hadoop.fs.path;import org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.text;import org.apache.hadoop.mapreduce.job;import org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import org.apache.hadoop.mapreduce.lib.input.fileinputformat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.genericoptionsparser;public class wordcount { public static class tokenizermapper extends mapper< Object, text, text, intwritable>{ private final static intwritable one = new intwritable (1); private text Word = new text (); public void map (Object key, Text value, Context context ) throws ioexception, interruptedexception { stringtokenizer itr = new stringtokenizer ( Value.tostring ()); //(1) Use spaces for Word segmentation while ( Itr.hasmoretokens ()) { word.set (Itr.nextToken ()); //(2) put token into the text object context.write (Word, one); } } } public static class intsumreducer extends reducer<text, Intwritable,text,intwritable> { &nbsP; private intwritable result = new intwritable (); Public void reduce (text key, iterable<intwritable> values, Context context ) throws IOException, interruptedexception { int sum = 0; for (intwritable val : values) { sum += val.get (); } result.set (sum); context.write (Key, result); //(3) Outputs the statistical results of each token &nbsP; } } public static void main (String[] args) throws exception { configuration conf = new configuration (); string[] otherargs = new genericoptionsparser (Conf, args). Getremainingargs (); if (otherargs.length < 2) { system.err.println ("USAGE:&NBSP;WORDCOUNT&NBSP;<IN>&NBSP;[<IN> ....") <out> "); system.exit (2); } job job = new job (conf, "Word count"); Job.setjarbyclass (Wordcount.class); job.setmapperclass (Tokenizermapper.class); job.setcombinerclass (Intsumreducer.class); job.setreducerclass ( Intsumreducer.class); jOb.setoutputkeyclass (Text.class); job.setoutputvalueclass (IntWritable.class); for (int i = 0; i < otherargs.length - 1; ++i) { fileinputformat.addinputpath (Job, new Path ( Otherargs[i]); } fileoutputformat.setoutputpath (job, new path (otherargs[otherargs.length - 1]); system.exit (Job.waitforcompletion (true) ? 0 : 1); }}
In the position of (1) WordCount uses Java's stringtokenizer with the default configuration, which is based only on the empty glyd participle. To omit standard punctuation during the word breaker, add them to the StringTokenizer delimiter list:
StringTokenizer ITR = new StringTokenizer (value.tostring (), "\t\n\r\f,.:;?! []’");
Because you want Word statistics to ignore case, turn all the words into lowercase before converting them to text objects:
Word.set (Itr.nexttoken (). toLowerCase ());
You want to show only words that appear more than 4 times:
if (Sum > 4) context.write (key, result);
6. Hadoop history
Founder: Doug Cutting
Around 2004--google published two papers to discuss the Google File system (GFS) and the MapReduce framework.
January 2006-Yahoo hired Doug to work with a dedicated team to improve Hadoop and use it as an open source project.
This article is from the "Data Craftsman" blog, please be sure to keep this source http://artist.blog.51cto.com/4061938/1716297
[Hadoop in Action] Chapter 1th Introduction to Hadoop