Hadoop Streaming and pipes

Source: Internet
Author: User
Tags socket stdin hadoop mapreduce

The derivation of the original problem: http://bbs.hadoopor.com/viewthread.php?tid=542 
in the forum, we found two articles that wrote MapReduce using C + + +.  
http://bbs.hadoopor.com/thread-256-1-1.html 
http://bbs.hadoopor.com/thread-420-1-2.html  
I. Among them, using stream to write a mapreduce program using stream, the reduce task will not be able to wait until all map tasks are completed, which is not very understanding.  
Two. From the implementation of the two methods. It feels a little strange. Under Linux, it is common to read data from stdin to the way it is piped, while reading data through the socket is stream, but in Hadoop it seems to be the opposite of Linux. I don't know why . .
Three. From the code you can see that in Hadoop, stream is using stdin, and pipes is using a socket. What are the pros and cons of these two comparisons?  
By:guxiangxi

One, 22 questions I do not understand, more than how to answer. The third problem I am more concerned about, because before using a streaming, feeling is not particularly useful, the current situation is still familiar with C + +, but write MapReduce or java. Just pipes exactly what I wanted. Here are three articles to refer to
1, http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
2, http://remonstrate.wordpress.com/2010/10/01/hadoop-on the-c-routine/
3, http://blog.endlesscode.com/2010/06/16/simple-demo-of-streaming-and-pipes/

summarized as follows:
1.streaming is a Hadoop-provided API that can be used by other programming languages for MapReduce, because Hadoop is Java-based (because the author is better at Java, Both Lucene and Nutch are the authors of Hadoop. Hadoop streaming is not complicated, it just uses UNIX's standard input and output as the development interface for Hadoop and other programming languages, so in other programming languages, only the standard input is used as the input of the program, and the standard output is the output of the program. In the standard input and output, key and value are tab-delimited, and in the standard input of reduce, the Hadoop framework guarantees that the input data is sorted by key.

2.Hadoop Pipes is the C + + interface for Hadoop MapReduce. Unlike Hadoop streaming using standard input and output (of course streaming can also be used in C + +), the socket used by Hadoop pipes to communicate with Tasktacker and map/reduce as a pipe, not a standard input output, Rather than JNI. Hadoop pipes cannot run in standalone mode , so configure it to pseudo-distributed mode first, because Hadoop pipes relies on Hadoop's distributed cache technology, The distributed cache is only supported when HDFs is running. Unlike Java interfaces, the key and value of Hadoop pipes are STL-based strings, so developers need to manually convert data types when processing.

3, in essence Hadoop pipes and Hadoop streaming do almost the same thing, in addition to the communication between the two, pipes can take advantage of the counter features of Hadoop. Compared to Java native code, Java native code can use any data type that implements the writable interface as Key/value, while pipes and streaming must be converted through a string (communication overhead, storage (large pin). Perhaps for this reason, pipes may be removed from Hadoop in the future. Of course, if the computational cost is high, perhaps Java native code does not have the high efficiency of C + + execution, then it may write streaming code later. Pipes uses a byte array, which can be encapsulated in std:string, except that the example is converted into a string input and output. This requires the programmer to design a reasonable input and output mode (segmentation of the data key/value).

has been confirmed:pipes has been removed from Hadoop. Run $ ~/hadoop-0.21.0/bin/hadoop, you have not seen this item of pipe.

Reference on the use:
1, Http://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
2, Http://code.google.com/p/hypertable/wiki/MapReduceWithHypertable

Ext.: Http://hi.baidu.com/huaqing03/blog

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.