Translation http://spark.apache.org/docs/latest/ml-guide.html machine Learning Library Mlib Guide
Mlib is a machine learning library running on spark to facilitate machine learning in the Scala language. Provides the following features: ML algorithm: Provides common machine learning operator functions such as classification, regression, clustering, and collaborative filtering: feature extraction, transformation, dimensionality reduction, and selection of pipe lines: build, evaluate, and tune tool caches: Save and load operators, models, and Piping toolset: linear algebra, Statistics, data processing Announcement: Dataframe-based API is a private API
The Rdd-based API is currently in the maintenance phase.
From spark2.0 onwards, the RDD-based API in the Spark.mlib package has entered the maintenance phase, and the private API in machine learning is stored in the SPARK.ML package, all based on the Dataframe API.
The above information means: Mlib still supports the RDD-based API in the Spark.mlib package, and fixes a bug from time to time. Mlib currently does not add new features. spark2.x Subsequent updates will gradually support the Dataframe-based API, which is similar to the RDD API. The Rdd-based API is deprecated once the base and Dataframe APIs have been developed (expected in the spark2.2 version). spark3.0 completely remove the RDD-based API
Why abandon the RDD API and support Dataframe's API instead
Dataframe's API is more user-friendly than the RDD API, Dataframe's advantages include spark data sources, sql/dataframe queries, tungsten and catalyst optimization, and a unified API across languages.
The Dataframe-based machine learning API is a unified set of cross-machine learning algorithms and across multiple languages. Dataframe also facilitates the use of ML pipe lines, especially functional conversions.
What Spark ml is.
Spark ML, not the official name, but occasionally refers to the Dataframe-based API, because the package name of the Dataframe API is org.apache.spark.ml. and spark ML Pipelines is also meant to emphasize the concept of pipelines.
Mlib already abandoned?
Not yet. Mlib includes the RDD and Dataframe two APIs, which are currently in the maintenance phase, neither obsolete nor mlib. Dependent
Mlib uses a linear algebraic packet breeze (Netlib-java package that relies on optimized data processing), and if the native library is not included in runtime, the JVM will error and replace with a clean interface.
Due to the reason of the license, we do not include Netlib-java this library, which needs to be installed or imported into the project itself.
If you use the Python language, you need the 1.4+ version of NumPy. Migration Guide
Mlib is still in the development phase, Experimental/developerapi Tagged API means there will be changes in the future, the following describes the change from 2.0 to 2.1 2.0-"2.1 change Remove the deprecated method Setlabelcol in feature. Chisqselectormodel Numtrees in classification. Randomforestclassificationmodel numtrees in regression. Randomforestregressionmodel model in regression. Linearregressionsummary Validateparams in Pipelinestage validateparams in Evaluator abandonment and change abandonment
Deprecate all Param setters methods except for input/output column Params for Decisiontreeclassificationmodel, Gbtclassi Ficationmodel, Randomforestclassificationmodel, Decisiontreeregressionmodel, Gbtregressionmodel and Randomforestregressionmodel Change spark-17870:fix a bug of Chisqselector which would likely changes its resul T. Now chisquareselector use PValue rather than raw statistic to select a fixed number of top features. Spark-3261:kmeans returns potentially fewer than K cluster centers in cases where k distinct centroids aren ' t available o R aren ' t selected. Spark-17389:kmeans reduces the default number of steps from 5 to 2 for the k-means| | Initialization mode.