Apache Spark 2.3 Introduction to Important features

Source: Internet
Author: User
Tags scalar pyspark

In order to continue to achieve spark faster, easier and smarter targets, Spark 2 3 has made important updates in many modules, such as structured streaming introduced low-latency continuous processing (continuous processing); Stream-to-stream joins;

In order to continue to achieve spark faster, easier and smarter targets, spark 2.3 has made important updates in many modules, such as structured

Streaming introduces low-latency continuous processing (continuous processing), supports Stream-to-stream joins, and improves pandas

UDFs performance to enhance Pyspark, support for the fourth scheduling engine Kubernetes clusters (the other three separate modules are self-contained

Type Standalone,yarn, Mesos). In addition to these important milestones, Spark 2.3 has several important updates:
Introduction of DataSource v2 APIs [SPARK-15689, SPARK-20928]

Vectorization (vectorized) of ORC reader [SPARK-16060]

Spark History Server v2 with K-v store [SPARK-18085]

Machine learning Pipeline API model based on structured streaming [SPARK-13030, SPARK-22346, SPARK-23037]

MLlib enhancement [SPARK-21866, SPARK-3181, SPARK-21087, SPARK-20199]

Spark SQL Enhancements [SPARK-21485, SPARK-21975, SPARK-20331, SPARK-22510, SPARK-20236]

This article will briefly describe some of the advanced features and improvements above, and see the Spark 2.3 release notes:https://spark.apache.org/releases/spark-release-2-3-0.html for more features.

Continuous stream processing with millisecond delay

Apache Spark 2.0 's structured streaming decouples micro-batch processing (Micro-batch processing) from its advanced APIs for two reasons: first, it's easier for developers to learn these APIs without having to consider these A PIs processing of micro batches; second, it allows developers to treat a stream as an infinite table, and they query the stream's data as easily as they would query a static table.

Big Data Learning can add groups: 716581014

However, in order to provide different stream processing modes for developers, the community introduced a new millisecond low latency (millisecond low-latency) mode: Continuous mode (continuous modes).

Internally, the structured flow engine steps through the query calculation in the micro-batch, and the execution cycle is determined by the trigger interval, which is tolerable for most real-world streaming applications.

For continuous mode, the stream reader continuously pulls the source data and processes the data instead of reading a batch of data at the specified trigger interval. By continuously querying source data and processing data, new records are processed as soon as they arrive, shortening the wait time to milliseconds to meet the needs of low-latency applications, as shown in the following illustration:

The current continuous mode supports Map-like Dataset operations, including projection (projections), selections, and other SQL functions, but does not support Current_timestamp (), current_date (), and aggregate functions. It also supports Kafka as a data source and data storage destination (sink), as well as console and memory sink.

Developers can now select pattern continuous or micro batches based on latency requirements to build large-scale real-time streaming applications, while these systems also enjoy the fault-tolerance and reliability provided by structured streaming Guarantees characteristics.

In a nutshell, the continuous mode in Spark 2.3 is experimental and provides the following features:

End-to-end millisecond latency
At least one semantic guarantee
Dataset operations that support Map-like

Flow to join with flow

The Spark 2.0 version of structured streaming supports joins between stream dataframe/dataset and static datasets, but Spark 2.3 brings a join operation for long-awaited flows and flows. Supports internal and external connections and can be used in a large number of real-time scenarios.

Ad revenue is a typical use case for flow-to-stream joins. For example, a display ad stream and an ad click Stream share common keywords (such as adid) and related data that you want to stream analytics, and based on that data you can analyze which ads are easier to click.

This example seems to be a simple answer, but there are a number of technical challenges to implementing stream and stream joins, as follows:

You need to cache the deferred data until a matching event is found from another stream;
Limiting the use of buffers through the watermark mechanism;
Users can make a tradeoff between resource usage and latency;
Consistent SQL connection semantics between static and streaming connections.

Apache Spark and Kubernetes

Apache Spark and Kubernetes combine their capabilities to provide large-scale distributed data processing at the slightest surprise. In Spark 2.3, users can start spark work on the Kubernetes cluster with the new Kubernetes Scheduler backend. This allows the Spark job to share resources with other jobs on the Kubernetes cluster.

In addition, Spark can use all management features, such as resource quotas (Resource Quotas), pluggable authorizations (pluggable Authorization), and Logging (Logging).

Support for Pyspark Pandas UDFs

Pandas UDFs, also known as vectorized UDFs, is the main driving force for improving Pyspark performance. Built on Apache Arrow, it offers you the perfect solution for both worlds: low overhead and high performance UDFs, and fully written in Python.

In Spark 2.3, there are two types of Pandas UDFs: scalar (scalar) and group mapping (grouped map). Both can be used in Spark 2.3.

Here are some benchmarks for running, and you can see that Pandas UDFs provides better performance than Row-at-time UDFs.

MLlib Promotion

Spark 2.3 includes a number of MLlib improvements, including algorithms, features, performance, scalability, and availability. Only three of these are described here.

First, in order to move the MLlib model and the pipelines to the production environment, the models now fitted (fitted models) and pipelines can be used in structured streaming jobs. Some existing pipelines (pipelines) need to be modified to make predictions in a streaming job.

Second, in order to implement many deep learning image analysis use cases, spark 2.3 introduces a utility that Imageschema [SPARK-21866] uses to represent images in Spark dataframe and to load common format images.

Finally, for developers, Spark 2.3 introduces an improved Python API to write custom algorithms.

Big Data Learning QQ Group: 716581014 focus on big data analysis methods, big Data programming, Big Data Warehouse, Big Data case, artificial intelligence, data mining, AI and other big data content sharing exchange. Regular online and offline big data content sharing activities

Apache Spark 2.3 Introduction to Important features

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.