What ' s new in Spark 1.2.0

Source: Internet
Author: User

What ' s new in Spark 1.2.0

1.2.0 was released on 12/18, 2014

On May 30, 2014, Spark 1.0 and September 11 announced the release of Spark1.1, and Spark 1.2 was finally released on December 18. As 1. The third release of the X-era, what is the important update?

1. Spark Core: Improvements in performance and ease-of-use

Two important updates on performance and stability for ultra-large Shuffle,spark core:

A) Communication manager uses Netty to implement

Before 1.1, the results of shuffle were returned. There are two ways to use the Akka message-passing mechanism directly for smaller results. For larger results. The use of Blockmanager. Using Blockmanager is a good design that avoids driver consuming too much memory and oom and reduces the risk of GC. However, Blockmanger's handling is inefficient: it reads the results from disk to the kernel buffer, then to the buffer of the user space, and then to the kernel send buffer, during which there are multiple memory copies and kernel The switching cost of space to user space. Not only consumes unnecessary memory from the JVM, but also adds the frequency of the GC.

It's just that. With Filechannel.transferto, you can do zero copy. Detailed Visibility http://www.ibm.com/developerworks/library/j-zerocopy/

One of the implementations is Netty. In 1.2. Communication Manager was rewritten with Netty. As a matter of fact. The Netty Network module has been implemented in Org.apache.spark.network.netty, but this option is not open by default because it is not intact.

Also, using Netty is already the default. Spark.shuffle.blockTransferService has changed from 1.1 NiO to 1.2 new Netty.

Details about this PR are visible https://issues.apache.org/jira/browse/SPARK-2468

II) The default mechanism for shuffle conversion from hashbased to sort based

One of the people who have been criticized for MapReduce is that whatever sort is necessary, it needs to be sorted. Spark was before 1.1. Are all hash based Shuffle. However, hash based can consume a lot of memory. Of course, when the memory is not enough, it will be spill to disk, and then finally do a merge. For larger datasets, there is a decrease in performance due to disk IO. The performance of shuffle can be said to directly affect the performance of the entire job is not too. At 1.1, the sort based shuffle was introduced. At 1.2, this has matured and become the default option:

Spark.shuffle.manager from hash to sort.

And from the author Reynold Xin's test to see. Sort is better than hash in terms of speed and memory usage: "sort-based shuffle has lower memory usage and seems to outperformhash-based in almost all of our TE Sting. "

2. MLlib: Expanded Python API

3. Spark Streaming: Writeahead Log (WAL)-based HA is implemented to avoid data loss due to driver abnormal exit

4. GraphX: Performance and API improvements (Alpha)

Spark 1.2 is an important announcement by 172 contributors from more than 60 companies, schools and other research institutions. See from the quantity of contributor. The spark community is still one of the most active open source communities.

All the updates from spark can be seen that high-speed iterations are the king of the Internet.

Spark has developed to this day, although still have such problems, but rely on continuous iteration, the support of major manufacturers and the continued dedication of contributor, I believe that the Community will continue to develop rapidly.

Although commercial software may have tackled these issues a few years ago. Commercial software may have been optimally implemented in a given scenario. But the gift of the internet lies in not seeking the best. Just for the right. And for every small and medium-sized Internet company. The scene is constantly changing. It takes a structure that you can control, and as you evolve, you make high-speed iterations of this architecture.

And Spark, maybe this is the right structure for everyone.

PostScript: Although not a few small time, I found the strength to die completely. You also need to exercise. Practice AH.

Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.

What ' s new in Spark 1.2.0

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.