This version is an important milestone for structured streaming, as it can finally be formally used in production environments, and the experiment label (experimental tag) has been removed. Operation of any state is supported in the streaming system, and the streaming and batch APIs of Apache Kafka 0.10 support Read and write operations. In addition to adding new features in Sparkr, MLlib and GraphX, this version works more on system availability (usability), stability (stability), and code retouching (Polish) and solves more than 1100 tickets.
These new features will be described in detail in this article, including:
- Structured streaming's production environment support is ready;
- Extend the functionality of SQL;
- A new distributed machine learning algorithm is introduced in R.
- New algorithms added in MLlib and GraphX
Structured streaming
Structured streaming, introduced from Spark 2.0, provides a high-level API to build streaming applications, and is designed to provide an easy way to build end-to-end streaming applications (End-to-end streaming applications), provides consistency assurance and fault tolerant methods.
Starting with Spark 2.2.0, structured streaming has been ready to support production environments, in addition to the removal of experimental tags, including a number of high-level changes, such as:
- Kafka Source and Sink: Apache Kafka 0.10 's streaming and batch APIs support read and write operations;
- Kafka Improvements: The producer in Kafka to Kafka stream operations supports caching for low latency;
- Additional Stateful APIs:
[flat]MapGroupsWithState
Operations support complex state processing and time-out processing;
- Run Once Triggers: details: Running streaming Jobs Once a day for 10x cost Savings
SQL and Core APIs
Since the launch of Spark 2.0, Spark has become one of the most versatile and standard-compliant SQL query engines in the big data world. It can connect a variety of data sources, and can execute SQL-2003 standard statements, including analytic functions and subqueries, on these data. Spark 2.2 also adds a number of new SQL features, including:
- API Update: unifies the syntax of the data source and hive Serde table, and
CREATE TABLE
SQL query supports broadcast prompts (broadcast hints) such as broadcast, Broadcastjoin, and Mapjoin;
- Overall performance and Stability:
- The filter, join, aggregate, project, and limit/sample operations support cardinality statistics based on the cost optimizer (cost-based optimizer cardinality estimation);
- Use star heuristic (Star-schema heuristics) to improve tpc-ds performance;
- CSV and JSON file Listing/io performance improvement;
- Hiveudaffunction support Partial collection;
- Introducing the aggregation operators based on JVM objects
- Other changes that warrant attention:
- Supports parsing of multiple lines of JSON and CSV files
- Commands for parsing partitioned tables
MLlib and Sparkr
The final major change in Spark 2.2.0 is focused on advanced analytics, MLlib and GraphX add the following new algorithms:
- Local sensitive hash (Locality sensitive Hashing)
- Multilevel Logistic regression (multiclass logistic Regression)
- Personalised PageRank (personalized PageRank)
Spark 2.2.0 also adds the following distributed algorithms in SPARKR:
- Alternating least squares (als,alternating Least squares)
- Isotonic regression (isotonic Regression)
- Multilayer perceptual classifier (multilayer Perceptron Classifier)
- Stochastic forest (random Forest)
- Gaussian mixture models (Gaussian Mixture model)
- Linear discriminant Analysis (Linear discriminant analyses, LDA)
- Multilevel Logistic regression (multiclass logistic Regression)
- Gradient Lift tree (Gradient Boosted Trees)
- Structured streaming API supports R language
- To_jsonsupported in R, From_json
- Support Multi-column Approxquantile
With the increase of these algorithms, SPARKR has become the most comprehensive distributed machine learning Library in R.
This article is reproduced from https://www.iteblog.com/archives/2194.html
English original Reference https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
Apache Spark 2.2.0 New features Introduction (reprint)