An article to understand the features of Spark 1.3+ versions

Last Update:2016-12-20 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

New features of Spark 1.6.x
Spark-1.6 is the last version before Spark-2.0. There are three major improvements: performance improvements, new dataset APIs, and data science features. This is a very important milestone in community development.
1. Performance improvement
According to the Apache Spark Official 2015 spark Survey, 91% of users want to improve the performance of spark.
Parquet Performance
Automated memory management
Stream state management speed increased by 10X

2. Dataset API
The Spark team introduced Dataframes, the new dataset API.

3. New Scientific Computing functions
Machine Learning Pipeline Persistence
New Algorithms and Features:
Univariate and Bivariate statistics
Survival analysis
Normal equation for least squares
bisecting K-means Clustering
Online hypothesis Testing
Latent Dirichlet Allocation (LDA) in ML pipelines
R-like Statistics for Glms
Feature Interactions in R formula
Instance weights for Glms
Univariate and bivariate statistics in Dataframes
LIBSVM Data source
Non-standard JSON data

New features of Spark 1.5.x
1, Dataframe performance optimization of the bottom layer (the first phase of the tungsten filament plan)
1.1 Spark manages the memory itself, instead of relying on the JVM to manage the content. This avoids the performance overhead of the JVM GC and can control the problem with Oom.
1.2 Java objects are stored and computed directly using the internal binary format, eliminating the performance overhead of serialization and deserialization, and saving memory overhead.
1.3 Perfect The Unsafeshufflemanager of shuffle stage, add a lot of new functions, optimize shuffle performance.
1.4 By default using Code-gen, using the cache-aware algorithm, enhanced the performance of join, aggregation, shuffle, sorting, enhanced the performance of the window function, performance is several times higher than the 1.4.x version

2, DataFrame
2.1 Implements the new aggregation function interface, AggregateFunction2, and provides 7 new built-in aggregate functions.
2.2 Implements more than 100 new expression function, such as Unix_timestamp, which enhances the processing of Nan
2.3 Support for connecting different versions of Hive Metastore
2.4 Support Parquet 1.7

3, Spark streaming: More perfect Python support, non-experimental Kafka Direct API and so on.

New features of Spark 1.4.x
After 4 RC versions, spark 1.4 was eventually released before Spark Summit, and this article briefly talks about the very important new feature and improvement in this release.
Sparkr will not elaborate, in the data scientists, it is very wistfully, awaited began to come out ... This is obviously going to be in a separate article:)

Spark Core:
What do you care about most now? Performance and Operation! What is the most affecting performance? Must Shuffle! What is the first priority of operations? It has to be surveillance (just don't pull the alert)! 1.4 At these two points have done a full effort. In 1.4, Spark provides the rest API for the app to get a variety of information (Jobs/stages/tasks/storage info), and using this API to build your own monitoring is a matter of minutes, and more than that, the DAG can now be visualized, It's not clear how Spark's Dagscheduler works, and now it's easy to know the details of the DAG. Again, shuffle, as we all know, starting from 1.2 sort-based Shuffle has become the default Shuffe strategy, the sort-based shuffle does not need to open many files at the same time, and can also reduce the generation of intermediate files, However, the problem is that a large number of Java objects are left in the JVM heap, and the output of the shuffle map phase will be serialized at 1.4, which brings two benefits: 1, spill to disk, the file becomes smaller 2, the GC efficiency increases, some will say, Serialization deserialization results in additional CPU overhead ah, in fact, the shuffle process is often IO-intensive operations, which brings about this CPU overhead, is acceptable.
The expected Tungsten filament program (project Tungsten) is also in 1.4 emerged, introducing the new shuffle Manager "Unsafeshufflemanager" to provide a cache-friendly sorting algorithm, among other improvements, The goal is to reduce the amount of memory used during the shuffle process and to speed up the sequencing process. The tungsten filament project will certainly be the focus of the next two releases (1.5,1.6).

Spark Streaming:
Streaming added a new UI to this version, which is simply the gospel of streaming users, with a wide range of details. In the words of Spark China Summit, TD was sitting next to me review this part of code, quietly said I "This is awesome". By the way, this part is mainly made by Zhu Shixiong, although the poet put me pigeons at the summit, but must thank him for bringing us such a good feature! In addition, this version also supports the Kafka version of 0.8.2.x.

Spark SQL (DataFrame)
Support Veteran Orcfile, although younger than parquet, but others bug AH:) 1.4 provides a similar to the window function in hive, or more practical. This time for the optimization of the join IS compared to the force, especially for the larger join, we can experience. The user of the JDBC server must be very happy, because finally there is a UI to see it.

Spark Ml/mllib
ML pipelines graduated from alpha, everyone's enthusiasm for ML pipelines is really quite high. I am interested in the personalized PageRank with GRAPHX, which is related to Recommendall in matrix factorization model. In fact, most companies will still implement their own algorithms on spark.

Spark 1.3 new features
Spark SQL out of alpha version
In version 1.3, Spark SQL officially broke out of the alpha version, providing better SQL standard compatibility. At the same time, the Spark SQL data source API also enables interaction with the new component Dataframe, allowing users to generate dataframe directly from hive tables, parquet files, and some other data sources. Users can mix SQL and data frame operators on the same dataset. The new version provides the ability to read and write tables from JDBC to support Postgres, MySQL, and other RDBMS systems more natively. The API also provides write support for JDBC (or otherwise) connected data source generation output tables.

Built-in support for Spark Packages
At the end of 2014, we embarked on a new Community project directory site--spark Packages for Spark. Today, Spark packages already contains 45 community projects that developers can use, including data source integration, test tools, and tutorials. To make it easier for spark users, in Spark 1.3, users can import a published package directly into the spark shell (or in a program with independent flag).
Spark Packages also created an SBT plugin for developers to simplify the release of packages and provide automatic compatibility checks for publishing packages.

Lower level of Kafka support in spark streaming
From several releases in the past, Kafka has become a very popular input source for spark streaming. Spark 1.3 introduces a new KAKFA streaming source that leverages the playback capabilities of Kafka to provide a more reliable delivery semantics in a non-pre-write log configuration. At the same time, for applications that require strong consistency, it also provides primitives that implement the exactly-once guarantees. On Kafka's support, version 1.3 also adds a Python API and primitives that support this API.

A new algorithm in Mllib
Spark 1.3 also provides a number of new algorithms. Among them, latent Dirichlet Allocation (LDA) became the first thematic modeling algorithm to appear in Mllib. Prior to this, the logistic regression of Spark has supported multi-class classification (Multiclass classification) through multivariate logistic regression (multinomial logistic regression). In this version, the cluster is again promoted, Gaussian Mixture models and Power iteration clustering are introduced. Frequent itemsets Mining (fim,frequent itemsets Mining) has been extended through fp-growth. Finally, Mllib also introduces an effective block matrix abstraction for distributed Linear algebra.

An article to understand the features of Spark 1.3+ versions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More