A probe into Scala spark machine learning

Source: Internet
Author: User
Tags new set

Transformer: is an abstract class containing a feature converter, and the final learning model, the need to implement the Transformer method typically Transformer add several columns to an RDD, eventually converting to another RDD, 1. A feature converter typically processes a dataset, converting one column of data into a new set of data. and add a new data column behind the dataset, resulting in a new dataset output. 2. A learning model converter is used to process a data set, read the column containing the eigenvectors, predict a result label for each eigenvector, add the result label as a new data column, and output the results.

Estimator: Machine Learning Algorithm abstract class, need to implement fit () method, fit method will handle an RDD, produce a transformer. For example Logistricregression is a estimator, call fit method to train out a Logistricregressionmodel object, this is a transformer. Transformer and estimator are stateless. Each instance has a unique ID pipeline Job machine learning field, a set of algorithms are commonly used to process and learn data, such as a simple text document processing process includes the following steps to convert text words into digital eigenvectors with eigenvectors and tags to train a model out.

Spark ml can represent these processes with pipeline. Pipeline is the engineering nature of things, feel similar to the factory model, can be the whole process, as well as each step on the transformer, estimator assembled together.

Enter the following code directly on the Spark-shell command line to execute it. Stringindexer can map a value in a property column to a numeric type.  However, the default data data of the logistic regression classifier is sequential and orderly, so the number generated by stringindexer needs to be processed further. Here with Onehotencoder, the single-Hot code is one-hot encoding, also known as a valid encoding, the method is to use n-bit status register to encode n states, each state by his independent register bit, and at any time, only one of them is valid.

It can be understood that for each feature, if it has m possible values, then after the single-hot code, it becomes the M two-dollar feature. Also, these features are mutually exclusive, with only one activation at a time. As a result, the data becomes sparse.

The main benefits of this are:

    1. Solves the problem that the classifier does not handle the attribute data well

    2. To some extent, it also plays an important role in expanding features.

Import Org.apache.spark.ml.feature._

Import Org.apache.spark.ml.classification.LogisticRegression

Import Org.apache.spark.mllib.linalg. {Vector, Vectors}

Import Org.apache.spark.mllib.regression.LabeledPoint

Val df= sqlcontext.createdataframe (Seq (

(0, "a"),

(1, "B"),

(2, "C"),

(3, "a"),

(4, "a"),

(5, "C"),

(6, "D")). TODF ("id", "category")

Val indexer = new Stringindexer (). Setinputcol ("category"). Setoutputcol ("Categoryindex"). Fit (DF)

Val indexed = Indexer.transform (DF)

Indexed.select ("category", "Categoryindex"). Show ()

Val encoder = new Onehotencoder (). Setinputcol ("Categoryindex"). Setoutputcol ("Categoryvec")

Val encoded = encoder.transform (indexed)

val data = encoded.rdd.map {x = =

{

Val featurevector = Vectors.dense (X.getas[org.apache.spark.mllib.linalg.sparsevector] ("CategoryVec"). ToArray)

Val label = X.getas[java.lang.integer] ("id"). todouble

Labeledpoint (label, Featurevector)

}

}

var result = sqlcontext.createdataframe (data)

Scala> Result.show ()

+-----+-------------+

|label| features|

+-----+-------------+

| 0.0| [1.0,0.0,0.0]|

| 1.0| [0.0,0.0,1.0]|

| 2.0| [0.0,1.0,0.0]|

| 3.0| [1.0,0.0,0.0]|

| 4.0| [1.0,0.0,0.0]|

| 5.0| [0.0,1.0,0.0]|

| 6.0| [0.0,0.0,0.0]|

+-----+-------------+

The entire features column becomes a sparse matrix.

A probe into Scala spark machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.