Original in http://blog.sina.com.cn/s/blog_6e273ebb0100pid0.html
For a long time, hadoop has been controversial because of the performance issues brought about by its Java implementation. At the same time, many solutions have emerged to alleviate this problem.
Jeff hammerbacher (Chief Scientist of cloudera) wrote the following on Quora:
Certificate -------------------------------------------------------------------------------------------------------------------------------------
Doug's newest project, Avro [1], will allow for cross-language serialization and RPC. if you think individual components of hadoop cocould be implemented more efficiently in another language, you'll be welcome to try your hand once the migration to Avro for RPC
[2] is complete.
In my experience, distributed systems shoshould focus on reliable performance under stress, horizontal scalability, and percent of debugging before optimizing for efficiency. matt Welsh does a great job of highlighting this issue in his own spective on Seda [3].
Sean Quinlan of Google mentions a similar policy at Google, noting that "it's atypical of Google to put a lot of work into tuning any one particle binary. "[4] Java has advantages and disadvantages along these dimensions, but I'll leave that for others
Discuss.
For HDFS in participates, libhdfs [5] implements a c api to HDFS by communicating with Java over JNI. using libhdfs and fuse, one can mount HDFS just like any other file system [6]. once Avro is in place, the client cocould be implemented in C and placed in
Kernel to make this process even smoother and more efficient. Currently it's not the most pressing issue in hadoop development.
For hadoop mapreduce, you can use hadoop streaming to write your mapreduce logic in any language, or hadoop pipes [7] If you want a C ++-specific API. if you can't wait for Avro, there's also the "hadoop C ++ extension" [8] from Baidu which implements the task
Execution Environment in hadoop in C ++, and appears to provide moderate performance gains.
[1] http://avro.apache.org
[2] https://issues.apache.org/jira/browse/HADOOP-6659
[3] http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
[4] http://queue.acm.org/detail.cfm? Id = 1594206
[5] http://hadoop.apache.org/common/docs/current/libhdfs.html
[6] https://wiki.cloudera.com/display/DOC/Mountable+HDFS
[7] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html
Https://issues.apache.org/jira/browse/MAPREDUCE-1270
Certificate -------------------------------------------------------------------------------------------------------------------------------------
When using hadoop, Baidu also discovered the inefficiency caused by the Java language of hadoop and extended hadoop.
Before that, Baidu also tried hadoop pipes and hadoop streamming, but found these problems:
-Neither of the two solutions can control the memory usage of the Child JVM (MAP tasktracker and reduce tasktracker). This part is controlled by the JVM itself, all you can do is use-xmx to set the memory limit;
-Both schemes can only affect Mapper and reducer callback functions, while the sort and shuffle processes that really affect performance are still executed in tasktracker implemented in Java;
-Data Flow Problems. In both solutions, the data processing flow must flow from tasktracker to mapper or reducer and then stream back. It is difficult to avoid data movement whether using pipeline or socket to transmit data. The cost of large-scale data processing cannot be ignored.
The root cause is that the C ++ module has little logic. Therefore, Baidu proposed a more thorough solution, namely, "hadoop C ++ extention", in which the C ++ code intruded more into hadoop. It processes all the data completed in the original tasktracker to the c ++ module, and only allows it to be responsible for protocol communication and control. In this way, the above problems are solved:
-Tasktracker JVM is only responsible for a small amount of communication, and its memory needs are small and predictable, so it is easy to control. For example, setting it to-xmx100m is enough;
-Both the sort and shuffle processes are implemented using the C ++ module to improve the performance;
-Data is only stored in the C ++ module throughout its lifecycle to avoid unnecessary data movement.
This is like pushing forward the front of the C ++ module. Of course, many people may think that this is only the difference between step 50 and step 1, but the extra step 50 is the bottleneck of performance.