Apache Spark 2.2.0 New features Introduction (reprint)

Source: Internet
Author: User
Tags sparkr

This version is an important milestone for structured streaming, as it can finally be formally used in production environments, and the experiment label (experimental tag) has been removed. Operation of any state is supported in the streaming system, and the streaming and batch APIs of Apache Kafka 0.10 support Read and write operations. In addition to adding new features in Sparkr, MLlib and GraphX, this version works more on system availability (usability), stability (stability), and code retouching (Polish) and solves more than 1100 tickets.

These new features will be described in detail in this article, including:

    • Structured streaming's production environment support is ready;
    • Extend the functionality of SQL;
    • A new distributed machine learning algorithm is introduced in R.
    • New algorithms added in MLlib and GraphX
Structured streaming

Structured streaming, introduced from Spark 2.0, provides a high-level API to build streaming applications, and is designed to provide an easy way to build end-to-end streaming applications (End-to-end streaming applications), provides consistency assurance and fault tolerant methods.

Starting with Spark 2.2.0, structured streaming has been ready to support production environments, in addition to the removal of experimental tags, including a number of high-level changes, such as:

    • Kafka Source and Sink: Apache Kafka 0.10 's streaming and batch APIs support read and write operations;
    • Kafka Improvements: The producer in Kafka to Kafka stream operations supports caching for low latency;
    • Additional Stateful APIs:  [flat]MapGroupsWithState Operations support complex state processing and time-out processing;
    • Run Once Triggers: details: Running streaming Jobs Once a day for 10x cost Savings
SQL and Core APIs

Since the launch of Spark 2.0, Spark has become one of the most versatile and standard-compliant SQL query engines in the big data world. It can connect a variety of data sources, and can execute SQL-2003 standard statements, including analytic functions and subqueries, on these data. Spark 2.2 also adds a number of new SQL features, including:

    • API Update: unifies the syntax of the data source and hive Serde table, and CREATE TABLE SQL query supports broadcast prompts (broadcast hints) such as broadcast, Broadcastjoin, and Mapjoin;
    • Overall performance and Stability:
      • The filter, join, aggregate, project, and limit/sample operations support cardinality statistics based on the cost optimizer (cost-based optimizer cardinality estimation);
      • Use star heuristic (Star-schema heuristics) to improve tpc-ds performance;
      • CSV and JSON file Listing/io performance improvement;
      • Hiveudaffunction support Partial collection;
      • Introducing the aggregation operators based on JVM objects
    • Other changes that warrant attention:
      • Supports parsing of multiple lines of JSON and CSV files
      • Commands for parsing partitioned tables
MLlib and Sparkr

The final major change in Spark 2.2.0 is focused on advanced analytics, MLlib and GraphX add the following new algorithms:

    • Local sensitive hash (Locality sensitive Hashing)
    • Multilevel Logistic regression (multiclass logistic Regression)
    • Personalised PageRank (personalized PageRank)

Spark 2.2.0 also adds the following distributed algorithms in SPARKR:

  • Alternating least squares (als,alternating Least squares)
  • Isotonic regression (isotonic Regression)
  • Multilayer perceptual classifier (multilayer Perceptron Classifier)
  • Stochastic forest (random Forest)
  • Gaussian mixture models (Gaussian Mixture model)
  • Linear discriminant Analysis (Linear discriminant analyses, LDA)
  • Multilevel Logistic regression (multiclass logistic Regression)
  • Gradient Lift tree (Gradient Boosted Trees)
  • Structured streaming API supports R language
  • To_jsonsupported in R, From_json
  • Support Multi-column Approxquantile

With the increase of these algorithms, SPARKR has become the most comprehensive distributed machine learning Library in R.

This article is reproduced from https://www.iteblog.com/archives/2194.html

English original Reference https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html

Apache Spark 2.2.0 New features Introduction (reprint)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.