The derivation of the original problem: http://bbs.hadoopor.com/viewthread.php?tid=542
in the forum, we found two articles that wrote MapReduce using C + + +.
http://bbs.hadoopor.com/thread-256-1-1.html
http://bbs.hadoopor.com/thread-420-1-2.html
I. Among them, using stream to write a mapreduce program using stream, the reduce task will not be able to wait until all map tasks are completed, which is not very understanding.
Two. From the implementation of the two methods. It feels a little strange. Under Linux, it is common to read data from stdin to the way it is piped, while reading data through the socket is stream, but in Hadoop it seems to be the opposite of Linux. I don't know why . .
Three. From the code you can see that in Hadoop, stream is using stdin, and pipes is using a socket. What are the pros and cons of these two comparisons?
By:guxiangxi
One, 22 questions I do not understand, more than how to answer. The third problem I am more concerned about, because before using a streaming, feeling is not particularly useful, the current situation is still familiar with C + +, but write MapReduce or java. Just pipes exactly what I wanted. Here are three articles to refer to
1, http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
2, http://remonstrate.wordpress.com/2010/10/01/hadoop-on the-c-routine/
3, http://blog.endlesscode.com/2010/06/16/simple-demo-of-streaming-and-pipes/
summarized as follows:
1.streaming is a Hadoop-provided API that can be used by other programming languages for MapReduce, because Hadoop is Java-based (because the author is better at Java, Both Lucene and Nutch are the authors of Hadoop. Hadoop streaming is not complicated, it just uses UNIX's standard input and output as the development interface for Hadoop and other programming languages, so in other programming languages, only the standard input is used as the input of the program, and the standard output is the output of the program. In the standard input and output, key and value are tab-delimited, and in the standard input of reduce, the Hadoop framework guarantees that the input data is sorted by key.
2.Hadoop Pipes is the C + + interface for Hadoop MapReduce. Unlike Hadoop streaming using standard input and output (of course streaming can also be used in C + +), the socket used by Hadoop pipes to communicate with Tasktacker and map/reduce as a pipe, not a standard input output, Rather than JNI. Hadoop pipes cannot run in standalone mode , so configure it to pseudo-distributed mode first, because Hadoop pipes relies on Hadoop's distributed cache technology, The distributed cache is only supported when HDFs is running. Unlike Java interfaces, the key and value of Hadoop pipes are STL-based strings, so developers need to manually convert data types when processing.
3, in essence Hadoop pipes and Hadoop streaming do almost the same thing, in addition to the communication between the two, pipes can take advantage of the counter features of Hadoop. Compared to Java native code, Java native code can use any data type that implements the writable interface as Key/value, while pipes and streaming must be converted through a string (communication overhead, storage (large pin). Perhaps for this reason, pipes may be removed from Hadoop in the future. Of course, if the computational cost is high, perhaps Java native code does not have the high efficiency of C + + execution, then it may write streaming code later. Pipes uses a byte array, which can be encapsulated in std:string, except that the example is converted into a string input and output. This requires the programmer to design a reasonable input and output mode (segmentation of the data key/value).
has been confirmed:pipes has been removed from Hadoop. Run $ ~/hadoop-0.21.0/bin/hadoop, you have not seen this item of pipe.
Reference on the use:
1, Http://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
2, Http://code.google.com/p/hypertable/wiki/MapReduceWithHypertable
Ext.: Http://hi.baidu.com/huaqing03/blog