Apache Spark 1.6 announces csdn Big Data | 2016-01-06 17:34 Today we are pleased to announce Apache Spark 1.6, with this version number, spark has reached an important milestone in community development: The spark Source code contributor has more than 1000 data. At the end of 2014, the number was just 500. So what are the new features of Spark 1.6? There are thousands of patches in Spark 1.6.
In this blog post, we will focus on three basic development topics: performance improvements, new dataset APIs, and extension of data science functions.
Performance improvement based on our 2015 Spark Survey report. 91% of users feel that performance is the most important aspect of spark, so performance optimization is a major focus of our spark development.
Parquet performance: Parquet has become one of the most frequently used data formats in spark, and the same time parquet scanning performance has a huge impact on many large applications. In the past. Spark's Parquet reader relies on PARQUET-MR to read and decode parquet files. When we are writing spark applications, we need to spend a lot of time on "record assembly". So that the process can rebuild the Parquet column as a data record. In Spark 1.6. We have introduced a new Parquet reader. It bypasses PARQUERT-MR's record assembly and uses a more optimized code path to get flat mode (flat schemas).
In our benchmark test, we found it through 5 tests. The new reader scan throughput rate can be added from 2.9 million rows per second to 4.5 million rows per second, with a performance improvement of nearly 50%. Self-active Memory management: There is also a performance boost from spark 1.6 that comes from better memory management, before spark 1.6, spark statically divides the available memory into two zones: running memory and cache memory. Running memory is the region used for sorting, hashing, and shuffling, while cache memory is the region used to cache hotspot data.
Spark 1.6 introduces a new memory manager that can proactively adjust the size of different memory areas. The size of the corresponding memory area is actively added or reduced at run time, depending on the running program. For many applications, it means that there is no need for users to manually adjust the situation. The amount of available memory is added in a large number of operations such as join and Aggregration. The two performance improvements described above are transparent to the user. There is no need to make changes to the code, and the following improvements are examples of a new API that can guarantee better performance.
Streaming state management 10 times-fold performance boost: in streaming applications. State management is an important feature that is often used to maintain aggregation or session information. By working together with many users, we have designed the state management API in spark streaming again. A new Mapwithstate API is introduced. It can scale linearly based on the number of updates rather than the entire number of records, meaning that it is more efficient to track "deltas" instead of always doing full-volume scanning of all the data.
In many workloads, such implementations can achieve an order of magnitude performance gain. We created a notebook to illustrate how to use the new feature. In the near future, we will also write a corresponding blog post to explain this part of the content. The Dataset API was introduced earlier this year by Dataframes. It provides advanced functions so that spark can better understand the data structure and run calculations. Additional information in the Dataframe enables the Catalyst Optimizer and tungsten run engine (tungsten execution engine) to proactively accelerate big data analysis in real-world scenarios. Since we published Dataframes, we have received a lot of feedback, and the lack of compile-time type Security support is one of many important feedback, in order to solve this problem, we are introducing the Dataframe API type extension namely datasets. The Dataset API extends the Dataframe API to support static types and user-defined functions so that they can be executed directly on the basis of existing Scala and Java types. With our comparison with the classic Rdd API, the dataset has better memory management and long task execution performance.
Please refer to spark Datasets entry for this blog post.
new Data Science functions machine learning Pipeline persistence: Many machine learning applications build learning pipelines using spark ml pipelining features, in the past, assuming that the program wanted to persist the pipeline to external storage, the user would need to implement the corresponding persistence code, and in Spark 1.6, The Pipelining API provides functions for saving and loading the pipeline for the previous state again, and then applying the previously built model to the new data behind. For example, a user trains an assembly line through night work. It is then applied to production data in a production job. New algorithms and capabilities: This version number also adds a range of machine learning algorithms at the same time, including: Univariate and bivariate statistical survival analysis of the least squares standard equation of equal value K-means clustering online if you examine the implied Dirichlet distribution in ml pipelining (latent Dirichlet Allocation. LDA) Generalized linear models (general Liner model, GLM) Class R statistical R formula features interactive GLM instance weight dataframes single variable and double variable statistics LIBSVM data source non-standard JSON data this blog post only gives the main features of this release number. We have also compiled a more specific set of release notes with an executable sample.
Over the next few weeks, we'll be rolling out more specific blog posts about these new features. Follow the Databricks blog to learn a lot about other spark 1.6 content.
Assuming you want to try out these new features, Databricks allows you to use spark 1.6 at the same time that you keep the old version of Spark. Register to get a free trial account. Without 1000 source contributors, Spark cannot be so successful today, and we take this opportunity to express our gratitude to all those who have contributed to spark. Translator/On True review/Zhu Zhengju Zebian/Zhonghao Translator Introduction: Nuo Yajen. Chinese Academy of Computer Information Processing professional postgraduate, focus on big data technology and data mining direction.
The original page has been transcoded by the QQ browser cloud
Apache Spark 1.6 Announcement (Introduction to new Features)