Spark Mlib Learning Guide

Source: Internet
Author: User
Tags deprecated

Translation http://spark.apache.org/docs/latest/ml-guide.html machine Learning Library Mlib Guide

Mlib is a machine learning library running on spark to facilitate machine learning in the Scala language. Provides the following features: ML algorithm: Provides common machine learning operator functions such as classification, regression, clustering, and collaborative filtering: feature extraction, transformation, dimensionality reduction, and selection of pipe lines: build, evaluate, and tune tool caches: Save and load operators, models, and Piping toolset: linear algebra, Statistics, data processing Announcement: Dataframe-based API is a private API

The Rdd-based API is currently in the maintenance phase.
From spark2.0 onwards, the RDD-based API in the Spark.mlib package has entered the maintenance phase, and the private API in machine learning is stored in the SPARK.ML package, all based on the Dataframe API.

The above information means: Mlib still supports the RDD-based API in the Spark.mlib package, and fixes a bug from time to time. Mlib currently does not add new features. spark2.x Subsequent updates will gradually support the Dataframe-based API, which is similar to the RDD API. The Rdd-based API is deprecated once the base and Dataframe APIs have been developed (expected in the spark2.2 version). spark3.0 completely remove the RDD-based API

Why abandon the RDD API and support Dataframe's API instead

Dataframe's API is more user-friendly than the RDD API, Dataframe's advantages include spark data sources, sql/dataframe queries, tungsten and catalyst optimization, and a unified API across languages.

The Dataframe-based machine learning API is a unified set of cross-machine learning algorithms and across multiple languages. Dataframe also facilitates the use of ML pipe lines, especially functional conversions.

What Spark ml is.

Spark ML, not the official name, but occasionally refers to the Dataframe-based API, because the package name of the Dataframe API is org.apache.spark.ml. and spark ML Pipelines is also meant to emphasize the concept of pipelines.

Mlib already abandoned?

Not yet. Mlib includes the RDD and Dataframe two APIs, which are currently in the maintenance phase, neither obsolete nor mlib. Dependent

Mlib uses a linear algebraic packet breeze (Netlib-java package that relies on optimized data processing), and if the native library is not included in runtime, the JVM will error and replace with a clean interface.
Due to the reason of the license, we do not include Netlib-java this library, which needs to be installed or imported into the project itself.

If you use the Python language, you need the 1.4+ version of NumPy. Migration Guide

Mlib is still in the development phase, Experimental/developerapi Tagged API means there will be changes in the future, the following describes the change from 2.0 to 2.1 2.0-"2.1 change Remove the deprecated method Setlabelcol in feature. Chisqselectormodel Numtrees in classification. Randomforestclassificationmodel numtrees in regression. Randomforestregressionmodel model in regression. Linearregressionsummary Validateparams in Pipelinestage validateparams in Evaluator abandonment and change abandonment

Deprecate all Param setters methods except for input/output column Params for Decisiontreeclassificationmodel, Gbtclassi Ficationmodel, Randomforestclassificationmodel, Decisiontreeregressionmodel, Gbtregressionmodel and Randomforestregressionmodel Change spark-17870:fix a bug of Chisqselector which would likely changes its resul T. Now chisquareselector use PValue rather than raw statistic to select a fixed number of top features. Spark-3261:kmeans returns potentially fewer than K cluster centers in cases where k distinct centroids aren ' t available o R aren ' t selected. Spark-17389:kmeans reduces the default number of steps from 5 to 2 for the k-means| | Initialization mode.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.