Original question leads to see: http://bbs.hadoopor.com/viewthread.php? Tid = 542
I searched the Forum and found two articles using C/C ++ to write mapreduce:
Http://bbs.hadoopor.com/thread-256-1-1.html
Http://bbs.hadoopor.com/thread-420-1-2.html
I. It is not quite understood that using stream to write mapreduce programs requires the reduce task to be executed after all MAP tasks are completed.
II. from the implementation of the two methods. it feels a bit strange. in Linux, reading data from stdin is generally considered as a pipeline, but reading data through socket is stream. However, in hadoop, it seems to be called the opposite of that in Linux. I don't know why.
3. From the code, we can see that in hadoop, stream uses stdin, while pipes uses Socket. What are the advantages and disadvantages of the two.
By: guxiangxi
I don't understand either of the first or second questions. The third question is more important to me, Because streaming was used before and it is not particularly useful. Now I am still familiar with C ++, but I still use Java to write mapreduce. Pipes is exactly what I want. The following are three articles for reference:
1. http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
2. http://remonstrate.wordpress.com/2010/10/01/hadoop-%E4%B8%8A%E7%9A%84-c-%E4%BE%8B%E7%A8%8B/
3. http://blog.endlesscode.com/2010/06/16/simple-demo-of-streaming-and-pipes/
Summary:
1,StreamingIt is an API provided by hadoop that can use other programming languages for mapreduce, because hadoop is based on Java (because the author is good at Java, Lucene and nutch are both from the hadoop author ). Hadoop streaming is not complex. It only uses UNIX standard input and output as the development interface of hadoop and other programming languages. Therefore, in programs written in other programming languages, you only need to use the standard input as the program input and the standard output as the program output. In the standard input and output, the key and value are separated by tab. In the standard input of reduce, the hadoop framework ensures that the input data is sorted by key.
2,Hadoop PipesIs the C ++ interface of hadoop mapreduce. Unlike hadoop streaming that uses standard input and output (of course, streaming can also be used for C ++), hadoop pipes uses sockets as pipelines for communication between tasktacker and MAP/reduce, it is not a standard input output, not a JNI.Hadoop Pipes cannot run in standalone ModeTherefore, you must first configure the pseudo-distributed mode, because hadoop pipes relies on hadoop's distributed cache technology, and distributed cache is only supported when HDFS is running. Unlike Java interfaces, the key and value of hadoop pipes are both strings Based on STL. Therefore, developers need to manually convert data types during processing.
3. In essence, pipes and hadoop streaming do almost the same thing. Apart from the communication between the two, pipes can use the counter feature of hadoop. Compared with Java native code, Java native code can use any data type that implements the Writable interface as the key/value, while pipes and streaming must undergo one conversion through strings (high communication overhead, storage overhead ). Pipes may be removed from hadoop later. Of course, if the computing cost is high, Java native code may be less efficient than C ++, and streaming code may be written in the future. Pipes uses byte array, which can be encapsulated with std: string, but is converted into string input and output in the example. This requires the programmer to design a reasonable input/output method (data key/value Segmentation ).
Confirmed:Pipes has been removed from hadoop. Run $ ~ /Hadoop-0.21.0/bin/hadoop, you can't see this item of pipe.
Usage reference:
1. http://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
2. http://code.google.com/p/hypertable/wiki/MapReduceWithHypertable