A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
For the past few months, we had been busy working on the next major release of the big data open source software we love: Apache Spark 2.0. Since Spark 1.0 came out both years ago, we have heard praises and complaints. Spark 2.0 builds on "What do we have learned in the past" years, doubling down "What are users love and improving on?" RS Lament. While this blog summarizes the three major thrusts and themes-easier, faster, and Smarter-that comprise Spark 2.0, the Mes highlighted here deserve deep-dive discussions that we'll follow up with in-depth blogs in the next few weeks.
Before we dive in, we is happy to announce the availability of the Apache Spark 2.0 Technical Preview in Databricks Commu Nity Edition today. This preview package is built using the upstream branch-2.0. Using The preview package is as simple as selecting the "2.0 (Branch Preview)" version when launching a cluster:
Whereas the final Apache Spark 2.0 release is still a few weeks away, this technical preview are intended to provide early Access to the features in Spark 2.0 based on the upstream codebase. This is the satisfy your curiosity to try the shiny new toy, while we get feedback and bug reports early before the Final release.
Now let's take a look at the new developments.
Easier:sql and streamlined APIs
One thing we are proud of the in Spark was creating APIs that's simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on both areas: (1) standard SQL support and (2) unifying Dataframe/dataset A Pi.
On the SQL side, we had significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL Parser and support for Subqueries. spark 2.0 can run all the Tpc-ds queries, which require many of the sql:2003 FE Atures. because SQL have been one of the primary interfaces Spark applications use, this extended SQL capabilities DRA stically reduce the porting effort of legacy applications over to Spark.
On the programming API side, we had streamlined the apis:unifying dataframes and Datasets in scala/java: starting I N Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (E.g. map,filter, groupbykey) and the untyped methods (E.G.&NBSP;SELECT,&NBSP;GROUPBY) is available on the Dataset class. Also, this new combined Dataset interface are the abstraction used for structured streaming. Since compile-time type-safety in Python and R isn't a language feature, the concept of Dataset does not apply to these L Anguages ' APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the Single-node data frame notion in These languages. Get a peek from A dataset API notebook. Sparksession: a new entry point, replaces the old SqlContext and Hivecontext. For users of the DataFrame APIs, a common source of confusion for Spark are which "context" to use. Now you can use Sparksession, which subsumes both, as a sinGLE entry point, as demonstrated in this notebook. Note that the old SqlContext and Hivecontext is still kept for backward compatibility. Simpler, more performant accumulator Api: we had designed a new accumulator API that had a Simpler type hierarchy an D support specialization for primitive types. The old accumulator API had been deprecated but retained for backward compatibility dataframe-based machine learning API E Merges as the primary ML Api: with Spark 2.0, the SPARK.ML package, with its "pipeline" APIs, would emerge as the prim ary Machine learning API. While the original Spark.mllib package is preserved, the future development'll focus on the dataframe-based API. Machine learning Pipeline Persistence: users can now save and Load machine learning pipelines and models AC Ross all programming languages supported by Spark. Distributed algorithms in r: added support for generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-means inR. Faster:spark as a Compiler
According to We spark Survey, 91% of users consider performance as the most important aspect of spark. As a result, performance optimizations has always been a focus in our Spark development. Before we started planning for Spark 2.0, we asked ourselves a question:spark are already pretty fast, but can we push the Boundary and make Spark 10X faster?
This is the question led us to fundamentally rethink the "we build Spark ' s physical execution layer. When modern to a data engine (e.g. Spark or other MPP databases), majority of the CPU cycles is spent in Usele SS work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPUs cycles wasted in these useless work have been a long time focus of MoD Ern compilers.
Spark 2.0 ships with the second generation tungsten engine. This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing. The main idea was to emit optimized bytecode at runtime this collapses the entire query into a single function eliminating Virtual function calls and leveraging CPU registers for intermediate data. We call this technique "Whole-stage code generation."
To give your a teaser, we have measured the amount of time (in nanoseconds) it would take-to-process a row on one core for Some of the operators in Spark 1.6 vs. Spark 2.0, and the table below are a comparison that demonstrates the power of the N EW tungsten engine. Spark 1.6 includes expression code generation technique that's also in use in some State-of-the-art commercial databases Today. As can see, many of the core operators is becoming an order of magnitude faster with whole-stage code generation.
You can see the power of Whole-stage code generation in action in this notebook, in which we perform aggregations and join s on 1 billion records in a single machine.cost per row (single thread)
|Primitive||Spark 1.6||Spark 2.0|
|Sum w/o Group||14ns||0.9ns|
|Sort (8-bit entropy)||620ns||5.3ns|
|Sort (64-bit entropy)||620ns||40ns|
How does the new engine work on end-to-end queries? We do some preliminary analysis using TPC-DS queries to compare spark 1.6 and Spark 2.0:
Beyond Whole-stage code generation to improve performance, a lot of work have also gone into improving the Catalyst Optimiz Er for general query optimizations such as nullability propagation, as well as a new vectorized parquet decoder that have I mproved Parquet scan throughput by 3X. smarter:structured Streaming
Spark Streaming has a long led the big data space as one of the first attempts at unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7, it offered developers with several powerful propertie S:exactly-once semantics, fault-tolerance at scale, and high throughput.
However, after working with hundreds of real-world deployments of Spark streaming, we found that applications that need to Make decisions in real-time often require more than just a streaming engine. They require deep integration of the batch stacks and the streaming stack, integration with external storage systems, as we ll as the ability to cope with changes in business logic. As a result, enterprises want more than just a streaming engine; Instead they need a full stack this enables them to develop end-to-end "continuous applications."
One school of thought is to treat everything like a stream; That's, adopt a single programming model integrating both batch and streaming data.
A number of problems exist with the this single model. First, operating on data as it arrives in can is very difficult and restrictive. Second, varying data distribution, changing business logic, and delayed Data-all add unique challenges. And third, most existing systems, such as MySQL or Amazon S3, don't behave like a stream and many algorithms (including M OST off-the-shelf machine learning) does not work in a streaming setting.
Spark 2.0 ' s structured streaming APIs is a novel-to approach streaming. It stems from the realization that the simplest-from-compute answers on streams of data is to not have to re Ason about The fact that it's a stream. This realization came from us experience with programmers who already know what to program static data sets (aka Batch) we ing Spark ' s powerful dataframe/dataset API. The vision of structured streaming is to utilize the Catalyst optimizer to discover when it's possible to transparently T Urn a static program to an incremental execution this works on dynamic, infinite data (aka a stream). When viewed through this structured lens of data-as discrete table or an infinite table-you simplify streaming.
As the first step towards realizing this vision, Spark 2.0 ships with an initial version of the structured streaming API, A (surprisingly small!) extension to the Dataframe/dataset API. This unification should make adoption easy for existing spark users, allowing them to leverage their knowledge of spark BA TCH API to answer new questions in real-time. Key features here would include support for event-time based processing, out-of-order/delayed data, sessionization and Tigh T integration with non-streaming data sources and sinks.
Streaming is clearly a pretty broad topic, so stay tuned to this blog for more details on structured streaming in Spark 2. 0, including details on what's possible in this release and what's on the roadmap Conclusion
Spark users initially came to Spark for its ease-of-use and performance. Spark 2.0 doubles down in these while extending it-to-support an even wider range of workloads. We hope you would enjoy the work we had put it in, and look forward to your feedback.
Of course, until the upstream Apache Spark 2.0 release is finalized, we does not recommend fully migrating any production wo Rkload onto this preview package. This new package should is available on Databricks Community Edition today, and we'll be rolling out to all Databricks C Ustomers over the next few days. To get access to Databricks Community Edition, join the waitlist. Read More
If you missed we webinar for Spark 2.0:easier, Faster, and Smarter, you can register and watch the recordings and Downlo Ad slides and attached notebooks.
Can also import the following notebooks and try on your databricks Community Edition with Spark 2.0 Technical Previ ew. Sparksession:a new entry point Datasets:a + streamlined API performance of Whole-stage code generation machine Learni Ng Pipeline Persistence
Start building with 50+ products and up to 12 months usage for Elastic Compute Service