Hadoop Streaming and Pipes

Last Update:2018-12-05 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original question leads to see: http://bbs.hadoopor.com/viewthread.php? Tid = 542
I searched the Forum and found two articles using C/C ++ to write mapreduce:
Http://bbs.hadoopor.com/thread-256-1-1.html
Http://bbs.hadoopor.com/thread-420-1-2.html
I. It is not quite understood that using stream to write mapreduce programs requires the reduce task to be executed after all MAP tasks are completed.
II. from the implementation of the two methods. it feels a bit strange. in Linux, reading data from stdin is generally considered as a pipeline, but reading data through socket is stream. However, in hadoop, it seems to be called the opposite of that in Linux. I don't know why.
3. From the code, we can see that in hadoop, stream uses stdin, while pipes uses Socket. What are the advantages and disadvantages of the two.
By: guxiangxi

I don't understand either of the first or second questions. The third question is more important to me, Because streaming was used before and it is not particularly useful. Now I am still familiar with C ++, but I still use Java to write mapreduce. Pipes is exactly what I want. The following are three articles for reference:
1. http://cxwangyi.blogspot.com/2010/01/writing-hadoop-programs-using-c.html
2. http://remonstrate.wordpress.com/2010/10/01/hadoop-%E4%B8%8A%E7%9A%84-c-%E4%BE%8B%E7%A8%8B/
3. http://blog.endlesscode.com/2010/06/16/simple-demo-of-streaming-and-pipes/

Summary:
1,StreamingIt is an API provided by hadoop that can use other programming languages for mapreduce, because hadoop is based on Java (because the author is good at Java, Lucene and nutch are both from the hadoop author ). Hadoop streaming is not complex. It only uses UNIX standard input and output as the development interface of hadoop and other programming languages. Therefore, in programs written in other programming languages, you only need to use the standard input as the program input and the standard output as the program output. In the standard input and output, the key and value are separated by tab. In the standard input of reduce, the hadoop framework ensures that the input data is sorted by key.

2,Hadoop PipesIs the C ++ interface of hadoop mapreduce. Unlike hadoop streaming that uses standard input and output (of course, streaming can also be used for C ++), hadoop pipes uses sockets as pipelines for communication between tasktacker and MAP/reduce, it is not a standard input output, not a JNI.Hadoop Pipes cannot run in standalone ModeTherefore, you must first configure the pseudo-distributed mode, because hadoop pipes relies on hadoop's distributed cache technology, and distributed cache is only supported when HDFS is running. Unlike Java interfaces, the key and value of hadoop pipes are both strings Based on STL. Therefore, developers need to manually convert data types during processing.

3. In essence, pipes and hadoop streaming do almost the same thing. Apart from the communication between the two, pipes can use the counter feature of hadoop. Compared with Java native code, Java native code can use any data type that implements the Writable interface as the key/value, while pipes and streaming must undergo one conversion through strings (high communication overhead, storage overhead ). Pipes may be removed from hadoop later. Of course, if the computing cost is high, Java native code may be less efficient than C ++, and streaming code may be written in the future. Pipes uses byte array, which can be encapsulated with std: string, but is converted into string input and output in the example. This requires the programmer to design a reasonable input/output method (data key/value Segmentation ).

Confirmed:Pipes has been removed from hadoop. Run $ ~ /Hadoop-0.21.0/bin/hadoop, you can't see this item of pipe.

Usage reference:
1. http://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
2. http://code.google.com/p/hypertable/wiki/MapReduceWithHypertable

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More