Offline lightweight Big Data platform Spark's mlib machine Learning Library Concept Learning

Source: Internet
Author: User
Tags svm

Mlib Machine Learning Library 1.1 machine learning concepts

Machine learning has many definitions, which tend to be defined below. Machine learning is the study of computer algorithms that can be automatically improved through experience. Machine learning relies on data experience and evaluates and optimizes the model that the algorithm runs out of. The machine learning algorithm attempts to maximize the mathematical objectives that represent the behavior of the algorithm based on the training data, and to make predictions or make decisions. Machine learning problems are categorized into several categories, including classification, regression, and clustering. All machine learning algorithms pass through an assembly line: the characteristics of the extraction training data, the evaluation model based on the feature vector training model, is the best choice. Feature extraction is mainly used to extract the numerical features of training data for mathematical modeling. Machine learning is generally categorized as follows:

1) Supervised learning

Monitoring is to learn a function (model) from a given training data set, and when entering new data, you can predict the result based on this function. The goal of the training set is manual tagging. Supervised learning is human participation in model evaluation, and common algorithms include regression analysis and statistical classification. Supervised learning fields are used for classification purposes to allow computers to learn the classification of manual labeling. Supervised learning is a common technique for training neural networks and decision trees. Neural networks and decision tree techniques are highly dependent on the implementation of identified classification information.

2) Unsupervised Learning

Unsupervised learning is the process of building a model in a training set with no one involved, all of which are automatically identified by the computer. The popular understanding is that people do not know how to identify, let the computer according to an algorithm to identify. Common application scenarios include learning and clustering of association rules, and common algorithms include Apriori algorithms and K-means algorithms.

3) semi-supervised learning

Semi-supervised learning is a kind of machine learning mode between supervised learning and unsupervised learning, which is a key problem in the field of pattern recognition and machine learning. Training and classification for a small number of labeling samples and large numbers of unlabeled samples. The main algorithms include probability-based algorithms, modified methods based on existing supervisory algorithms, and direct reliance on clustering assumptions. Semi-supervised learning is a part of human participation in the logo, the common algorithm includes classification and regression, including the extension of the commonly used supervised learning algorithm, the algorithm attempts to model the non-identified data, on this basis, then the identification of the data to predict the inference algorithm or Laplace support vector machine.

4) Intensive Learning

Intensive learning learns the completion of an action by observing it, and each action has an impact on the environment, and the learning object makes judgments based on the feedback of the observed surroundings. Input data as feedback to the model, the model is adjusted, common scenarios include dynamic systems and robot control, common algorithms include q-learning and time difference learning.

Common machine learning algorithms are:

L Tectonic condition probability: regression analysis and statistical classification;

L Artificial neural network;

L Decision Tree;

L Gaussian process regression;

l linear discriminant analysis;

L Nearest Neighbor law;

L Perceptron;

L radial basis function core;

l Support vector Machine;

L construct probability density function by regeneration model;

l Maximum expectation algorithm;

L Graphicalmodel: including Bayesian network and Markov with the airport;

L Generativetopographic Mapping;

L Approximate inference technology;

L Markov chain Monte Carlo method;

L variational method;

L Optimization: Most of the above methods use optimization algorithms directly or indirectly.

The main machine learning algorithms are:

1) Regression algorithm

The regression algorithm is a kind of algorithm that tries to use the measurement of error to explore the relationship between variables. Regression algorithm is a powerful tool for statistical machine learning. Common regression algorithms include: least squares (ordinary Least square), logistic regression (logisticregression), stepwise regression (stepwise Regression), multiple adaptive regression splines (multivariate Adaptive Regression splines) and local scatter smoothing estimates (locally estimated scatterplot smoothing).

2) instance-based algorithms

Instance-based algorithms are often used to model decision problems, and such models often pick up a batch of sample data and then compare the new data with the sample data based on some approximation. Find the best match in this way. Thus, instance-based algorithms are often referred to as "winner-take-all" learning or "memory-based learning". Common algorithms include K-nearest Neighbor (KNN), Learning vector quantization (Learningvector quantization, LVQ), and self-organizing mapping algorithms (self-organizing map, SOM)

3) Regularization method

The regularization method is the extension of other algorithms (usually the regression algorithm), which adjusts the algorithm according to the complexity of the algorithm. The regularization method usually rewards the simple model and punishes the complex algorithm. Common algorithms include: Ridge Regression, Least Absolute shrinkageand Selection Operator (LASSO), and elastic networks (Elastic net).

4) Decision Tree algorithm

Decision Tree algorithm uses tree structure to establish decision-making model according to the attribute of data, and decision tree model is often used to solve classification and regression problems. Common algorithms include: Classification and regression tree (classification and Regression tree, CART), ID3 (iterative Dichotomiser 3), C4.5, chi-squared Automatic Interaction Detection (CHAID), decision Stump, Machine Forest (Random Forest), multivariate adaptive regression spline (MARS), and gradient propulsion (Gradient boosting machine , GBM).

5) Bayesian algorithm

Bayesian algorithm is a kind of algorithm based on Bayesian theorem, which is mainly used to solve the problem of classification and regression. Common algorithms include: naive Bayesian algorithm, average single-dependency estimation (averaged one-dependence estimators, Aode), and Bayesian belief Network (BBN).

6) kernel-based algorithms

The most famous of kernel-based algorithms is support vector machine (SVM). Kernel-based algorithms map input data to a higher-order vector space, where some classification or regression problems can be solved more easily. Common kernel-based algorithms include: Support Vector machines (SVM), Radial basis functions (Radial Basis function, RBF), and linear discriminant analysis (Linear discriminate Analys is, LDA) and so on.

7) Clustering algorithm

Clustering is like regression, sometimes people describe a class of problems, sometimes describing a class of algorithms. Clustering algorithms typically merge input data by either a central point or a hierarchical approach. All clustering algorithms attempt to find the intrinsic structure of the data in order to classify the data in the most common way. Common clustering algorithms include the K-means algorithm and the desired maximization algorithm (Expectationmaximization, EM).

8) Association Rules Learning

Association rule Learning finds useful association rules in a large number of multivariate datasets by finding rules that best explain the relationship between data variables. Common algorithms include Apriori algorithm and Eclat algorithm.

9) Artificial Neural network algorithm

Artificial neural network algorithm is a kind of pattern matching algorithm simulating biological neural network. Typically used to solve classification and regression problems. Artificial neural network is a huge branch of machine learning, there are hundreds of different algorithms (in which deep learning is one of the algorithms, which we will discuss separately). Important artificial neural network algorithms include: Perceptron Neural Networks (Perceptron neural network), reverse transfer (backpropagation), Hopfield networks, self-organizing mappings (self-organizing map, SOM), Learning Vector Quantization (Learningvector quantization, LVQ).

10) Deep Learning algorithms

Deep learning algorithm is the development of artificial neural network, has won a lot of attention in the near future, especially Baidu began to exert deep learning, but also caused a lot of concern at home. In today's increasingly inexpensive computing power, deep learning attempts to build a much larger and more complex neural network. Many deep learning algorithms are semi-supervised learning algorithms used to handle large datasets with small amounts of data that are not identified. Common deep learning algorithms include: Restricted Boltzmann machines (Restricted Boltzmann machine, RBN), Deepin belief Networks (DBN), convolutional networks (convolutional network), stacked self- Encoder (stackedauto-encoders).

11) Reduction of dimension algorithm

Like the clustering algorithm, the reduced dimension algorithm tries to analyze the internal structure of the data, but the reduced dimension algorithm is an unsupervised learning method, which attempts to use less information to summarize or interpret the data. Such algorithms can be used to visualize high-dimensional data or to simplify data for supervised learning. Common algorithms include: PCA (Principle Component Analysis, PCA), Partial least squares regression (partial Least Square Regression, PLS), Sammon mappings, multidimensional scales (Mult I-dimensional Scaling,mds), projection tracking (Projection Pursuit), etc.

12) Integration algorithm

The integrated algorithm trains the same sample independently with some relatively weak learning models, then integrates the results for overall prediction. The main difficulty of integration algorithm is how to integrate the independent weak learning models and how to integrate the learning results. This is a very powerful algorithm, but also very popular. Common algorithms include: Boosting, bootstrapped Aggregation (Bagging), AdaBoost, stacking generalization (stacked generalization, Blending), gradient propulsion (Gradi ENT boosting machine, GBM), random forest (randomly Forest).

1.2Spark Mlib Introduction

Spark has a unique advantage in machine learning for the following reasons:

1) machine learning algorithms generally have many steps to iterate the calculation process, machine learning calculation needs to get enough small error or enough convergence to stop after multiple iterations, if you use the MapReduce Compute framework of Hadoop, each computation will read/write disk and task start, etc. This leads to very large I/O and CPU consumption. While Spark's memory-based computing model is inherently adept at iterative computing, multiple step computations are done directly in memory, and disks and networks are manipulated only when necessary, so Spark is the ideal platform for machine learning.

2) from a communications point of view, if you use Hadoop's MapReduce computing framework, the communication and data passing through heartbeat between Jobtracker and Tasktracker can lead to very slow execution speeds, and spark has excellent and efficient Akka and Netty communication system, communication efficiency is very high.

MLlib (Machinelearnig Lib) is a library of Spark's implementation of commonly used machine learning algorithms, including relevant test and data generators. Spark is designed to support some iterations of the Job, which fits a lot of machine learning algorithms.

MLlib currently supports 4 common machine learning issues: classification, regression, clustering, and collaborative filtering, as shown in MLlib's location in Spark's entire ecosystem.


The MLlib is based on an RDD that seamlessly integrates with Spark SQL, GraphX, spark streaming, and the RDD as the cornerstone, with 4 sub-frameworks that join together to build a big data center.

MLlib is part of Mlbase, where Mlbase is divided into four parts: MLlib, MLI, ML Optimizer, and Mlruntime.

L ML Optimizer will select the machine learning algorithms and related parameters that it deems most appropriate to have been implemented internally to process the data entered by the user and return the results of the model or other help analysis;

L MLI is an API or platform for the implementation of algorithms for feature extraction and advanced ML programming abstraction;

L MLlib is the Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and the underlying optimization, the algorithm can be expanded;

The Mlruntime is based on the Spark computing framework, which applies spark's distributed computing to machine learning.

Mlib mainly include: 1) Feature extraction: TF-IDF;2) statistics, 3) Classification and regression: belong to supervised machine learning, classification predicts discrete variables, and regression predicts continuous variables, linear regression, logistic regression, support vector machine, naive Bayes, decision tree and random forest ; 4) Clustering: Unsupervised machine learning, the object is divided into high-similarity clustering, supervised task data tagged, and clustering can be used for non-tagged data, mainly for data exploration and anomaly detection, there are Kmeans ; 5) Collaborative Filtering and recommendation: Collaborative filtering is a recommendation system technology that recommends new products based on user interaction and scoring of various products, alternating least squares, 6) dimensionality reduction: Principal component analysis, singular value decomposition.

See: https://spark.apache.org/docs/latest/mllib-guide.html

1.3Spark Mlib Architecture Parsing

As you can see from the architecture diagram, MLlib consists of three main parts:

L Underlying base: includes Spark's runtime, matrix, and vector libraries;

L Algorithm Library: includes generalized linear model, recommendation system, clustering, decision tree and evaluation algorithm;

L Utilities: Including the generation of test data, external data read-in and other functions.


1) Analysis of the underlying foundation of Mllib

The underlying base part mainly includes the vector interface and the Matrix interface, both of which use the Scala language based on the linear algebra library Breeze developed by Netlib and Blas/lapack.

The MLlib supports local dense vectors and sparse vectors, and supports scalar vectors.

MLlib supports both local and distributed matrices, and the supported distributed matrices are divided into Rowmatrix, Indexedrowmatrix, Coordinatematrix, and so on.

2) algorithm Library analysis of Mllib

Is the core content of the Mllib algorithm library.


Algorithms commonly used in Spark:

? Classification algorithm

Classification algorithm belongs to supervised learning, using a class tag known sample to establish a classification function or classification model, apply the classification model, can classify the data of unknown class tag in the database. Classification is an important task in data mining, which is currently used most commercially, and typical application scenarios include loss prediction, precise marketing, customer acquisition, and personality preference. MLlib currently supports classification algorithms: Logistic regression, support vector machine, naive Bayesian and decision tree.

? Regression algorithm

The regression algorithm belongs to supervised learning, each individual has a real number tag associated with it, and we hope that after giving the numeric characteristics used to represent these entities, the predicted tag values can be as close as possible to the actual values. MLlib currently supports regression algorithms: linear regression, ridge regression, lasso, and decision trees.

Case: Import the training data set, parse it into a tagged rdd, use the LINEARREGRESSIONWITHSGD algorithm to build a simple linear model to predict the value of the label, and finally calculate the mean variance to evaluate the coincidence of the predicted value and the actual value.

? Clustering algorithm

Clustering algorithms are non-supervised learning, usually used for exploratory analysis, based on the principle of "birds of a Feather", the group of samples of which there is no category in itself is clustered into different groups, such a set of data objects is called clusters, and the process of describing each such cluster. Its purpose is to make the samples belonging to the same cluster should be similar to each other, while the samples of different clusters should be sufficiently dissimilar, and typical application scenarios include customer segmentation, customer research, market segmentation, and value evaluation. MLlib currently supports a widely used Kmmeans clustering algorithm.

Case: Import training data set, use Kmeans object to cluster data into two clusters, the number of classes required will be passed to the algorithm, and then calculate the sum of the mean variance in the set (Wssse), you can increase the number of clusters of k to reduce the error. In fact, the optimal number of clusters is usually 1, because this is usually the "trough point" in the Wssse graph.

? Collaborative filtering

Collaborative filtering is often applied to recommender systems, which are designed to supplement the missing parts of the user-commodity correlation matrix. Mllib currently supports model-based collaborative filtering in which users and products are expressed through a small group of semantic elements, and these factors are also used to predict the missing element.

Case: Importing a training dataset, each row of data consists of a user, a product, and a corresponding score. Assuming that the score is dominant, use the default Als.train () method to evaluate the recommended model by calculating the mean variance of the predicted score.

3) Analysis of Mllib's utility program

The utilities section includes data validators, label two-ary and multivariate analyzers, a variety of data generators, and data loaders.

Offline lightweight Big Data platform Spark's mlib machine Learning Library Concept Learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.