Spark Mlib Learning Guide

Last Update:2018-07-23 Source: Internet

Author: User

Tags deprecated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Translation http://spark.apache.org/docs/latest/ml-guide.html machine Learning Library Mlib Guide

Mlib is a machine learning library running on spark to facilitate machine learning in the Scala language. Provides the following features: ML algorithm: Provides common machine learning operator functions such as classification, regression, clustering, and collaborative filtering: feature extraction, transformation, dimensionality reduction, and selection of pipe lines: build, evaluate, and tune tool caches: Save and load operators, models, and Piping toolset: linear algebra, Statistics, data processing Announcement: Dataframe-based API is a private API

The Rdd-based API is currently in the maintenance phase.
From spark2.0 onwards, the RDD-based API in the Spark.mlib package has entered the maintenance phase, and the private API in machine learning is stored in the SPARK.ML package, all based on the Dataframe API.

The above information means: Mlib still supports the RDD-based API in the Spark.mlib package, and fixes a bug from time to time. Mlib currently does not add new features. spark2.x Subsequent updates will gradually support the Dataframe-based API, which is similar to the RDD API. The Rdd-based API is deprecated once the base and Dataframe APIs have been developed (expected in the spark2.2 version). spark3.0 completely remove the RDD-based API

Why abandon the RDD API and support Dataframe's API instead

Dataframe's API is more user-friendly than the RDD API, Dataframe's advantages include spark data sources, sql/dataframe queries, tungsten and catalyst optimization, and a unified API across languages.

The Dataframe-based machine learning API is a unified set of cross-machine learning algorithms and across multiple languages. Dataframe also facilitates the use of ML pipe lines, especially functional conversions.

What Spark ml is.

Spark ML, not the official name, but occasionally refers to the Dataframe-based API, because the package name of the Dataframe API is org.apache.spark.ml. and spark ML Pipelines is also meant to emphasize the concept of pipelines.

Mlib already abandoned?

Not yet. Mlib includes the RDD and Dataframe two APIs, which are currently in the maintenance phase, neither obsolete nor mlib. Dependent

Mlib uses a linear algebraic packet breeze (Netlib-java package that relies on optimized data processing), and if the native library is not included in runtime, the JVM will error and replace with a clean interface.
Due to the reason of the license, we do not include Netlib-java this library, which needs to be installed or imported into the project itself.

If you use the Python language, you need the 1.4+ version of NumPy. Migration Guide

Mlib is still in the development phase, Experimental/developerapi Tagged API means there will be changes in the future, the following describes the change from 2.0 to 2.1 2.0-"2.1 change Remove the deprecated method Setlabelcol in feature. Chisqselectormodel Numtrees in classification. Randomforestclassificationmodel numtrees in regression. Randomforestregressionmodel model in regression. Linearregressionsummary Validateparams in Pipelinestage validateparams in Evaluator abandonment and change abandonment

Deprecate all Param setters methods except for input/output column Params for Decisiontreeclassificationmodel, Gbtclassi Ficationmodel, Randomforestclassificationmodel, Decisiontreeregressionmodel, Gbtregressionmodel and Randomforestregressionmodel Change spark-17870:fix a bug of Chisqselector which would likely changes its resul T. Now chisquareselector use PValue rather than raw statistic to select a fixed number of top features. Spark-3261:kmeans returns potentially fewer than K cluster centers in cases where k distinct centroids aren ' t available o R aren ' t selected. Spark-17389:kmeans reduces the default number of steps from 5 to 2 for the k-means| | Initialization mode.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More