Pyspark Learning Notes (4)--mllib and ml introduction

Source: Internet
Author: User
Tags pyspark

Spark mllib is a library dedicated to processing machine learning tasks in Spark, but in the latest Spark 2.0, most machine learning-related tasks have been transferred to the Spark ML package. The difference is that Mllib is based on RDD source data, and ML is a more abstract concept based on dataframe that can create a range of machine learning tasks, from data cleaning to feature engineering to model training. Therefore, the future in the use of spark processing machine learning tasks, will be mainly spark ml.

Spark ml mainly includes the following three-part features:

(1) data preparation : Feature extraction, transformation, selection and some natural processing methods.

(2) machine learning Algorithm : Common classification, clustering, regression algorithm.

(3) utility program : Common statistical methods and model evaluation methods.

The Spark ml package consists primarily of three abstract classes: converters (Transformer), Evaluators (estimator) , and pipelines (Pipeline). 1. Converters

The implementation of converters in Pyspark ml typically requires attaching a new column to Dataframe to transform the data. When derived from the converter base class (Pyspark.ml.Transformer), each new converter class needs to implement the. Transform () method, which requires passing a dataframe (mandatory requirement) to be converted.

The pyspark.ml.feature module provides a number of converters, and a brief description of some of the common converters is as follows:

(1)Binarizer: Converts a continuous variable to the corresponding binary value according to the specified threshold value.

(2)Bucketizer: Converts a continuous variable to a polynomial value based on a threshold list (the continuous variable is discretized to the specified range interval).

(3)chisqselector: This function can be used to chi-square (chi-square) test to complete the selection of quantitative characteristics in the classification model.

(4)Countvectorizer: This method is mainly used for marking text.

(5)HASHINGTF: A hash converter, entered as a list of tagged text, and returns a vector with a predetermined length with a count.

(6)IDF: This method calculates the reverse file frequency of the document list. (The document needs to be converted to vector representations in advance of HASHINGTF or Countvectorizer)

(7)Onehotencoder: This method encodes the classification into a binary vector column.

(8)PCA: Using principal component analysis to perform data dimensionality reduction.

(9)stopwordremover: Removes the deactivated word from the markup text.

(a)tokenizer: Word breaker, the default word breaker will be converted to lowercase, and then separated by a space separator participle.

(one)vectorindexer: This method generates an index vector for the category column.

(vectorslicer): Given an indexed list that extracts values from the eigenvector.

(Word2vec): This method converts a string (or sentence) as input, converting it to a map of the {string,vector} format,
2. Evaluation Device

The evaluator in Pyspark ml refers to the use of various statistical models of observational objects for prediction or classification . If derived from the abstract Evaluator Class (Pyspark.ml.Estimator), the new model must implement the Fit () method, which uses the data in Dataframe and some parameters (Default or custom) to fit the model.

The evaluation models in Pyspark ML mainly include classification model, regression model, clustering model and recommendation model. 2.1 Classification Model (pyspark.ml.classification)

(1)logisticregression: (support two and multiple (Softmax) logistic regression)

Logical regression uses a logarithmic function that associates input variables with category categories, and can be used to solve class two (sigmoid functions) and multiple classification problems.

(2)decisiontreeclassifier: (Supports binary and multi-class tags)

The classifier predicts the owning category of an observation object by constructing a decision tree.

(3)gbtclassifier: (Supports binary tags, continuous features and classification features)

Gradient Elevation Decision tree (gradient boosting Decision) is one of the integrated classification methods, and its base learner is a classification regression tree cart. Multiple weak classifiers are combined into a strong classifier by boosting method.

(4)Randomforestclassifier: (Support two-yuan label and multi-item label)

Stochastic forest decision Tree is another integrated classification method, and its base learner is decision tree. The bagging method (voting method) is used to classify the number of weak classifiers as the final classification result.

(5)Naivebayes: (support two-yuan label and multi-item label)

The model is based on Bayesian theory and uses conditional probability theory to classify observations.

(6)Multilayerperceptronclassifier:

The multilayer-aware classifier contains at least three fully connected artificial neuron layers (you need to specify the parameters when creating the model): the input layer (the number of neurons needs to be the same as that of the input dataset), the hidden layer (at least one, contains a nonlinear transformation function), and The output layer (the number of neurons needs to be the same as the number of output categories).

(7)onevsrest:

The model transforms many kinds of problems into two kinds of problems.
2.2 regression model (pyspark.ml.regression)

(1)Decisiontreeregressor: A decision tree for processing continuous labels.

(2)gbtregressor: Gradient lifting tree for processing continuous labels.

(3)randomforestregressor: Stochastic forest decision-making for the processing of continuous labels.

(4)linearregression: A simple linear regression model, which assumes that there is a linear relationship between the input feature and the continuous output label.

(5)generalizedlinearregression:

According to the different kernel (Gaussian, binomial, gamma, possion), the generalized linear regression model can solve the linear regression problem of different types.
2.3 Cluster model (pyspark.ml.clustering)

Clustering is a series of unsupervised models that look for hidden patterns in data.

(1)Kmeans: K-means algorithm, which divides the data into K-clusters, and iterates over those particles that make each observation point and its cluster of particles distance between squares and the newest ones.

(2)Bisectingkmeans: Two-point K mean-value algorithm combined with K-means algorithm and hierarchical clustering algorithm. Initially, the algorithm takes all the observation points as a cluster and then decomposes the data iteratively into K clusters.

(3)gaussianmixture: Gaussian mixture model, which uses a K-Gaussian distribution with unknown parameters to parse the dataset. Using the expectation maximization algorithm , the Gaussian function is found by maximizing the logarithmic likelihood function.

(4)LDA: A potential Dirichlet allocation model (latent Dirichlet allocation) that can be used to identify the underlying thematic information in a large document set or corpus. 2.4 Recommendation Model (pyspark.ml.recommendation)

(1)ALS: alternating least squares (alternating least squares) is a recommended algorithm based on the principle of collaborative filtering. The user-commodity matrix A_MXN can be decomposed into matrices U and V by the method of singular value decomposition (Singular value DECOMPOSITION,SVD) with a known number of users (user) m and items (item) quantities n, as follows:

It is then optimized by the ALS method (fixing one of the variables first, optimizing the other, then fixing another, optimizing the previous, alternating iterations), and finally the user's preference for a particular item (rating or preference value) can be derived.
3. Pipeline

Pipelines in Pyspark ml refer to the end-to-end process from conversion to evaluation , which performs the necessary data conversion operations on the input dataframe raw data , and finally evaluates the statistical model.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.