Use Cygwin to simulate Linux environment install configuration run based on stand-alone Hadoop__linux

Source: Internet
Author: User
Tags apache download

In fact, using Cygwin to simulate a Linux environment to run Hadoop is very easy, and simply configure it to run a stand-alone Hadoop.

Here, the more critical is Cygwin installation, in the choice of installation must be installed OpenSSH, otherwise will not be successful, the following simple Cygwin installation and configuration:

download installation of Cygwin

First click Http://cygwin.com/setup.exe Download Setup.exe, for example, save to the desktop, click on the download can be installed.

When choosing an installation type, it is best to select the first one and download it directly from the network and install it immediately, as shown in the figure:

Then choose the installation path, installation file storage path, connection mode (choose Use IE5 Settings here), download the site image, automatically create a download file list, the next step is more important: Select the installation type, you can click on the top of the loop style icon after the switch installation type, Yes, the last word of the topmost all line is install, as shown in the figure:

In fact, if you choose the Install installation type, you have selected the OpenSSH package.

To allow you to see the OpenSSH package, you can see the network-related packages below the net [icon] Install, as shown in the figure:

Slide the scroll down to see the OpenSSH, as shown in the figure:

If the version number is displayed under Cirrent, the package has been selected for this installation, otherwise a skip is displayed, meaning that the package is skipped and will not be installed.

Finally wait for the download to install, this process may take a little time.

Configuration of Cygwin

After the installation is complete, for example, my Cygwin is installed under the g:/cygwin/directory and is configured as follows:

To set environment variables:

Create new variable in system variable name: CYGWIN, Variable Value: Ntsec TTY "; Edit add variable name: Path, variable value: g:/cygwin/bin; other retention".

OK, the basic configuration is ready, you can configure Hadoop.

Hadoop currently has several versions: hadoop-0.16.4, hadoop-0.18.0, to Apache download one and unzip it.

Put the uncompressed Hadoop under the G disk, like mine: g:/hadoop-0.16.4.

Configuring Hadoop only needs to modify the hadoop-env.sh file in the g:/hadoop-0.16.4/conf directory to open it and you can see:

# The Java implementation to use. Required.
# Export Java_home=/usr/lib/j2sdk1.5-sun

Remove the annotation symbol for the second line and specify the value that is java_home on your machine, as follows:

# The Java implementation to use. Required.
Export java_home= "D:/program files/java/jdk1.6.0_07"

Note here that if there are spaces in your JDK installation directory, you need to use double quotes to cause the error.

Start Cygwin, which is currently in the Home/yourname directory, as shown in the figure:

Switch to the root directory to enter the g:/hadoop-0.16.4 directory and create a data entry directory Input-dir, as shown in the figure:

Next, open G:/hadoop-0.16.4/input-dir, and create a few new Notepad files in the directory, for example, I created 3: Input-a.txt, Input-b.txt, Input-c.txt. The contents of the three files are as follows:

Input-a.txt:as after append actor as as "Apache as" after add as

Input-b.txt:bench be bench believe background bench is block

Input-c.txt:cafe Cat Communications Connection cat cat Cat Cust Cafe

Then you can perform a sample of the frequency of a statistical English word that comes with Hadoop, direct input command bin/hadoop jar Hadoop-0.16.4-examples.jar wordcount input-dir Output-dir, where Hadoop-0.16.4-examples.jar is g:/ hadoop-0.16.4 Package example, Input-dir is the data entry directory, which already exists in the three files we created, Output-dir is the directory of the output after Hadoop processing, where you need to have a simple understanding of Google's MapReduce algorithm, mainly is a process of processing data, quoting a piece of technical article from IBM to understand:

Reference

The process of using MapReduce to handle large datasets, the MapReduce calculation process, in short, is to decompose large datasets into hundreds of small datasets, each of which is processed by a node in the cluster (typically a common computer) and produces intermediate results, These intermediate results are then merged by a large number of nodes to form the final result.

The core of the computational model is the map and Reduce two functions, which are implemented by the user, and the function is to convert the input <key to a certain mapping rule, value> to another or a batch of <key, value> to the output.


Table one MAP and Reduce functions

function input Output Description
Map <K1, v1> List (<k2,v2>) 1. The small dataset is further parsed into a batch of <key,value> pairs and is processed in the input Map function.
2. <k1,v1> of each input will output a batch of <k2,v2>. <k2,v2> is the intermediate result of the calculation.
Reduce <k2,list (v2) > <k3,v3> The input of intermediate results <k2,list (v2) > List (v2) represents a batch of value that belongs to the same K2

It is very simple to write a distributed parallel program based on MapReduce computing model, the programmer's main coding work is to realize the MAP and Reduce functions, and other complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, network communication, etc. The MapReduce framework, such as Hadoop, is handled, and the programmer doesn't have to worry about it at all.

Local calculation

On which computer the data is stored, the computer calculates this part of the data, which reduces the transmission of data over the network and reduces the need for network bandwidth. In a cluster-based distributed parallel system such as HADOOP, compute nodes can be easily expanded, and because it can provide a near unlimited computational power, but by the data need to flow between different computers, so network bandwidth has become a bottleneck, is very valuable, "local computing" Is the most effective means of saving network bandwidth, which the industry describes as "mobile computing is more economical than moving data."

Size of Task

When you cut an original large dataset into a small dataset, the small dataset is usually less than or equal to the size of a block in HDFS (the default is 64M), which ensures that a small dataset is on a computer for local computation. With M small dataset to be processed, start m map task, note that this m map task is distributed in parallel on N machines, and the number of Reduce tasks can be specified by the user.

Partition

The middle result of the MAP task output is divided into R (R is the number of predefined Reduce tasks), and the hash function, such as hash (key) mod R, is used to ensure that key within a certain range is determined by a Reduce Tasks to handle, you can simplify the process of reduce.

Combine

Before partition, can also do the intermediate result first combine, will have the same key in the intermediate result <key, value> to merge into a pair. The combine process is similar to the process of reduce, in many cases directly using the reduce function, but combine is a part of the map task that executes immediately after the map function is executed. Combine can reduce the number of <key and value> pairs in intermediate results, thereby reducing network traffic.

Reduce Task takes intermediate result from MAP task node

The intermediate results of the MAP task are stored on the local disk as files after Combine and Partition are finished. The location of the intermediate result file notifies the master Jobtracker, Jobtracker then notify the Reduce task to which datanode to take the intermediate result. Note that all MAP tasks produce intermediate results that are divided into r parts by their key with the same Hash function, and the R Reduce task is responsible for a key interval. Each reduce requires a number of MAP task nodes to obtain intermediate results that fall within their responsible Key interval, and then execute the Reduce function to form a final result file.

Task Pipeline

With the R Reduce task, there will be an end result of R, which in many cases is not required to be merged into a final result. Because the final result of this R can be used as input to another computing task, another parallel computing task begins.

The execution process looks like this:

shiyanjun@cbbd2ce9428e48b/cygdrive/g/hadoop-0.16.4
$ bin/hadoop jar Hadoop-0.16.4-examples.jar WordCount input-dir output-dir
Cygpath:cannot create short name of G:/hadoop-0.16.4/logs
08/09/12 20:42:50 INFO JVM. Jvmmetrics:initializing JVM Metrics with ProcessName
=jobtracker, sessionid=
08/09/12 20:42:51 INFO mapred. Fileinputformat:total input paths to Process:3
08/09/12 20:42:51 INFO mapred. Jobclient:running Job:job_local_1
08/09/12 20:42:51 INFO mapred. Maptask:numreducetasks:1
08/09/12 20:42:52 INFO mapred. Jobclient:map 0% Reduce 0%
08/09/12 20:42:52 INFO mapred. localjobrunner:file:/g:/hadoop-0.16.4/input-dir/i
nput-a.txt:0+57
08/09/12 20:42:52 INFO mapred. Taskrunner:task ' job_local_1_map_0000 ' done.
08/09/12 20:42:52 INFO mapred. taskrunner:saved output of Task ' JOB_LOCAL_1_MAP_
0000 ' to File:/g:/hadoop-0.16.4/output-dir
08/09/12 20:42:53 INFO mapred. Maptask:numreducetasks:1
08/09/12 20:42:53 INFO mapred. localjobrunner:file:/g:/hadoop-0.16.4/input-dir/i
nput-b.txt:0+48
08/09/12 20:42:53 INFO mapred. Taskrunner:task ' job_local_1_map_0001 ' done.
08/09/12 20:42:53 INFO mapred. taskrunner:saved output of Task ' JOB_LOCAL_1_MAP_
0001 ' to File:/g:/hadoop-0.16.4/output-dir
08/09/12 20:42:53 INFO mapred. Maptask:numreducetasks:1
08/09/12 20:42:53 INFO mapred. localjobrunner:file:/g:/hadoop-0.16.4/input-dir/i
nput-c.txt:0+56
08/09/12 20:42:53 INFO mapred. Taskrunner:task ' job_local_1_map_0002 ' done.
08/09/12 20:42:53 INFO mapred. taskrunner:saved output of Task ' JOB_LOCAL_1_MAP_
0002 ' to File:/g:/hadoop-0.16.4/output-dir
08/09/12 20:42:53 INFO mapred. Jobclient:map 100% Reduce 0%
08/09/12 20:42:54 INFO mapred. Localjobrunner:reduce > Reduce
08/09/12 20:42:54 INFO mapred. Taskrunner:task ' Reduce_z7f1uq ' done.
08/09/12 20:42:54 INFO mapred. taskrunner:saved output of Task ' reduce_z7f1uq ' t
o File:/g:/hadoop-0.16.4/output-dir
08/09/12 20:42:54 INFO mapred. Jobclient:job Complete:job_local_1
08/09/12 20:42:54 INFO mapred. Jobclient:counters:9
08/09/12 20:42:54 INFO mapred. Jobclient:map-reduce Framework
08/09/12 20:42:54 INFO mapred. Jobclient:map input records=3
08/09/12 20:42:54 INFO mapred. Jobclient:map Output records=30
08/09/12 20:42:54 INFO mapred. Jobclient:map input bytes=161
08/09/12 20:42:54 INFO mapred. Jobclient:map Output bytes=284
08/09/12 20:42:54 INFO mapred. Jobclient:combine input records=30
08/09/12 20:42:54 INFO mapred. Jobclient:combine Output records=16
08/09/12 20:42:54 INFO mapred. Jobclient:reduce input Groups=16
08/09/12 20:42:54 INFO mapred. Jobclient:reduce input Records=16
$

The above is a process that executes according to the given input data, in fact the result has been exported to the Output-dir directory, you can view it in the directory g:/hadoop-0.16.4/ Output-dir two files are generated below:. PART-00000.CRC and part-00000.

View the results of the processing as follows:

Alternatively, open the part-00000 file view directly below the G:/hadoop-0.16.4/output-dir directory, as shown below:

Actor 1
Add 2
After 2
Apache 1
Append 1
As 6
Background 1
Be 2
Believe 1
Bench 3
Block 1
Cafe 2
Cat 4
Communications 1
Connection 1
Cust 1

It's the same as above.

This is a very simple example, and Hadoop implements Google's MapReduce algorithm to process data.

We can take a quick look at the implementation of the WordCount class, with the example source code included in the Hadoop release package, and the Wordcount.java class implementation looks like this:

Package org.apache.hadoop.examples;

Import java.io.IOException;
Import java.util.ArrayList;
Import Java.util.Iterator;
Import java.util.List;
Import Java.util.StringTokenizer;

Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapred.JobClient;
Import org.apache.hadoop.mapred.JobConf;
Import Org.apache.hadoop.mapred.MapReduceBase;
Import Org.apache.hadoop.mapred.Mapper;
Import Org.apache.hadoop.mapred.OutputCollector;
Import Org.apache.hadoop.mapred.Reducer;
Import Org.apache.hadoop.mapred.Reporter;
Import Org.apache.hadoop.util.Tool;
Import Org.apache.hadoop.util.ToolRunner; /**

/**
* This is an example of a Hadoop map/reduce application.
* Read the input file to break down each line of the file into a single word and count how often the word appears.
* The output is the list of broken words and their word frequency.
* Use the following command to run: Bin/hadoop jar Build/hadoop-examples.jar WordCount
* [M <i>maps</i>] [R <i>reduces</i>] <i>in-dir</i> <i>out-dir</i >
*/
public class WordCount extends configured implements Tool {

/**
* Mapclass is an internal static class. The word for each line in the statistic data file.
*/
public static class Mapclass extends Mapreducebase
Implements Mapper<longwritable, text, text, intwritable> {

Private final static intwritable one = new intwritable (1);
Private Text Word = new text ();

public void Map (longwritable key, Text value,
Outputcollector<text, intwritable> output,
Reporter Reporter) throws IOException {
String line = value.tostring ();
StringTokenizer ITR = new StringTokenizer (line);
while (Itr.hasmoretokens ()) {
Word.set (Itr.nexttoken ());
Output.collect (Word, one);
}
}
}

/**
* Reduce is an internal static class. As an intermediate result class for counting the number of words, because this example is simple, it is not necessary to perform intermediate result merging.
*/
public static class Reduce extends Mapreducebase
Implements Reducer<text, Intwritable, Text, intwritable> {

public void reduce (Text key, iterator<intwritable> values,
Outputcollector<text, intwritable> output,
Reporter Reporter) throws IOException {
int sum = 0;
while (Values.hasnext ()) {
Sum + = Values.next (). get ();
}
Output.collect (Key, New intwritable (sum));
}
}

static int printusage () {//Prompt for the use of commands
System.out.println ("WordCount [M <maps>] [-R <reduces>] <input> <output>");
Toolrunner.printgenericcommandusage (System.out);
return-1;
}

/**
* Map/reduce The driver part of the program to implement the Submit Map/reduce task.
*/
public int run (string[] args) throws Exception {
jobconf conf = new jobconf (getconf (), wordcount.class);
Conf.setjobname ("WordCount");

The keys are words (strings)
Conf.setoutputkeyclass (Text.class);
The values are counts (ints)
Conf.setoutputvalueclass (Intwritable.class);

Conf.setmapperclass (Mapclass.class);
Conf.setcombinerclass (Reduce.class);
Conf.setreducerclass (Reduce.class);

list<string> Other_args = new arraylist<string> ();
for (int i=0 i < args.length; ++i) {
try {
if ("-M". Equals (Args[i])) {
Conf.setnummaptasks (Integer.parseint (args[++i));
else if ("-R". Equals (Args[i])) {
Conf.setnumreducetasks (Integer.parseint (args[++i));
} else {
Other_args.add (Args[i]);
}
catch (NumberFormatException except) {
SYSTEM.OUT.PRINTLN ("Error:integer expected instead of" + args[i]);
return Printusage ();
catch (ArrayIndexOutOfBoundsException except) {
SYSTEM.OUT.PRINTLN ("error:required parameter missing from" +
Args[i-1]);
return Printusage ();
}
}
Make sure there are exactly 2 parameters left.
if (Other_args.size ()!= 2) {
System.out.println ("Error:wrong Number of parameters:" +
Other_args.size () + "instead of 2.");
return Printusage ();
}
Conf.setinputpath (New Path (Other_args.get (0)));
Conf.setoutputpath (New Path (Other_args.get (1)));

Jobclient.runjob (conf);
return 0;
}


public static void Main (string[] args) throws Exception {
int res = Toolrunner.run (new Configuration (), New WordCount (), args);
System.exit (RES);
}

}

By the simple description of this example, a general understanding of the idea of MapReduce algorithm and its implementation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.