Hadoop Streaming and pipes

Last Update:2018-07-20 Source: Internet

Author: User

Tags socket stdin hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The derivation of the original problem: http://bbs.hadoopor.com/viewthread.php?tid=542
in the forum, we found two articles that wrote MapReduce using C + + +.
http://bbs.hadoopor.com/thread-256-1-1.html
http://bbs.hadoopor.com/thread-420-1-2.html
I. Among them, using stream to write a mapreduce program using stream, the reduce task will not be able to wait until all map tasks are completed, which is not very understanding.
Two. From the implementation of the two methods. It feels a little strange. Under Linux, it is common to read data from stdin to the way it is piped, while reading data through the socket is stream, but in Hadoop it seems to be the opposite of Linux. I don't know why . .
Three. From the code you can see that in Hadoop, stream is using stdin, and pipes is using a socket. What are the pros and cons of these two comparisons?
By:guxiangxi

One, 22 questions I do not understand, more than how to answer. The third problem I am more concerned about, because before using a streaming, feeling is not particularly useful, the current situation is still familiar with C + +, but write MapReduce or java. Just pipes exactly what I wanted. Here are three articles to refer to
1, http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
2, http://remonstrate.wordpress.com/2010/10/01/hadoop-on the-c-routine/
3, http://blog.endlesscode.com/2010/06/16/simple-demo-of-streaming-and-pipes/

summarized as follows:
1.streaming is a Hadoop-provided API that can be used by other programming languages for MapReduce, because Hadoop is Java-based (because the author is better at Java, Both Lucene and Nutch are the authors of Hadoop. Hadoop streaming is not complicated, it just uses UNIX's standard input and output as the development interface for Hadoop and other programming languages, so in other programming languages, only the standard input is used as the input of the program, and the standard output is the output of the program. In the standard input and output, key and value are tab-delimited, and in the standard input of reduce, the Hadoop framework guarantees that the input data is sorted by key.

2.Hadoop Pipes is the C + + interface for Hadoop MapReduce. Unlike Hadoop streaming using standard input and output (of course streaming can also be used in C + +), the socket used by Hadoop pipes to communicate with Tasktacker and map/reduce as a pipe, not a standard input output, Rather than JNI. Hadoop pipes cannot run in standalone mode , so configure it to pseudo-distributed mode first, because Hadoop pipes relies on Hadoop's distributed cache technology, The distributed cache is only supported when HDFs is running. Unlike Java interfaces, the key and value of Hadoop pipes are STL-based strings, so developers need to manually convert data types when processing.

3, in essence Hadoop pipes and Hadoop streaming do almost the same thing, in addition to the communication between the two, pipes can take advantage of the counter features of Hadoop. Compared to Java native code, Java native code can use any data type that implements the writable interface as Key/value, while pipes and streaming must be converted through a string (communication overhead, storage (large pin). Perhaps for this reason, pipes may be removed from Hadoop in the future. Of course, if the computational cost is high, perhaps Java native code does not have the high efficiency of C + + execution, then it may write streaming code later. Pipes uses a byte array, which can be encapsulated in std:string, except that the example is converted into a string input and output. This requires the programmer to design a reasonable input and output mode (segmentation of the data key/value).

has been confirmed:pipes has been removed from Hadoop. Run $ ~/hadoop-0.21.0/bin/hadoop, you have not seen this item of pipe.

Reference on the use:
1, Http://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
2, Http://code.google.com/p/hypertable/wiki/MapReduceWithHypertable

Ext.: Http://hi.baidu.com/huaqing03/blog

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More