Spark1.2 new features Overview

Source: Internet
Author: User
Keywords Large data Spark

"Editor's note" in 2014, spark open source ecosystem has been greatly increased, has become one of the most active open source projects in the large data field, now active in Hortonworks, IBM, Cloudera, MAPR and Pivotal and many other well-known large data companies, More than Spark SQL, Spark streaming, Mllib, GRAPHX and other related projects. So much attention, spark will have what kind of change, the following we look at Spark contributor, Spark preacher Chen Summary of the spark 1.2 new features:

Under the original text

1.2 was actually released in December, I am slightly surprised that I have always thought that a slight jump in the ticket will not be issued until the beginning of January next year. The update was attended by 172 developers and has more than 1000 commits. This is really a great number. This version brings us a lot of new features, and also has a lot of performance optimization points. Let me say a few more important.

Spark Core:

1, in the case of a large number of data, communication manager finally replaced netty-based implementation. Before the implementation of the slow to die because every time to read from the disk to the kernel, then to the user state, and then back to the kernel state into the network card, now with zerocopy to achieve. (Remember, Kafka also used this).

2, shuffle manager replaced with sort based, in the shuffle data is larger, performance will be improved. But a lot of people think that this kind of Hadoop is the same, the micro-blog also raised this point, I would like to reply to the explanation, found that Liancheng has replied. In fact, the current spark sort just in accordance with the partition key sorting, partition is currently not sorted, but even if the internal to sort, it is relatively easy to achieve. Hadoop is sorted by each kv in each partition.

Spark Streaming:

Finally "alleged" support fully h/a mode. A small amount of data may have been lost in the past when driver hung up. Now with a layer of Wal (write projectile log), a lot of places are using this stuff, remember HBase's write path? Every time you write to Memstore, you write to a place called Hlog to prevent data loss. Back to this problem, every time receiver received data there will be hdfs, so even if the driver hang up, when it restarted, it can continue to deal with. Of course, the implementation of the Wal is still the same, to driver after the restart, to recover data, and also clean out those obsolete.

Of course, I would like to specifically remind the unreliable receivers and reliable receivers these two things, interested in their own to see what the situation is.

Mllib

The most significant change here should be pipeline, and many of the friends who are engaged in machine learning will certainly be interested. Mllib's eldest brother auspicious in Beijing has already talked about this, here does not unfold, needs to point out, the present mllib is uses the Schemardd to represent the dataset. In other words, through the spark SQL and mllib between the channel. In the words of eating together I grabbed the good fortune to talk about some databricks cloud things, did not ask mllib things, know he came back for a false, PR has increased dramatically.

Graphx

The most noticeable aspect of this version should be the stable API, which means you don't have to worry about the changes to the API that you're writing now. The spot advertisement, next week Hangzhou Spark Meetup, will have graphx a splendid theme.

Spark SQL

Put the last reason is that the Spark SQL is really too hot, so you have to mention the PR quickly, quickly response, quickly merge, or not guaranteed in a short time to give you a conflict. The most important feature of this version is undoubtedly external data source, apply Liancheng ppt on a word, push predicates to DataSource, what meaning, such as you want to take the data from HBase to do some screening, Generally we need to remove the data from the HBase and then filter in the Spark engine, now, you can push this step to the data source, so that you can filter it when you fetch it. Sure, there's going to be a big change.

Another point to note is that I used to remind you on many occasions that the cache table in Spark SQL must be in the form of cachetable ("tablename"), otherwise you will not be able to enjoy the range of benefits of column storage. But many friends still use rdd.cache this native way to cache, the community also realize that this is not good, so now either cachetable or direct cache, is the expression of the same semantics, can enjoy the benefits of column storage.

Original link: Spark1.2 new feature Overview (Zebian/Zhonghao)

Welcome to subscribe to the "CSDN Big Data" micro-signal to obtain more large data biosphere information.

Welcome to @csdn Cloud Computing Micro Blog for the latest cloud computing and large data information.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.