Sparkmllib feature extraction, feature transformation and feature selection

Last Update:2018-07-26 Source: Internet

Author: User

Tags arrays normalizer idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Feature ExtractionTf-idf

TF-IDF is generally used in text mining to reflect the importance of a feature item. Set the feature item to T, the document is D, and the document set is D. The feature frequency (term frequency) TF (T,D) for the feature item appears in document D in number of times. Document frequency (Documents frequency) DF (T,D) represents the number of documents with the feature item T. If you are only using TF to measure importance, it is not useful to have multiple occurrences of a document but with minimal amount of information. It is therefore possible to use the inverse document frequency IDF (inverse document frequency) to measure the importance of feature items, as follows:

| D| represents the total number of documents, obviously if T appears in all documents, the IDF value is 0. Then the TF-IDF is:

Example:

Import Org.apache.spark.ml.feature. {HASHINGTF, IDF, tokenizer}

Val sentencedata = Spark.createdataframe (Seq (
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could with Case c Lasses "),
  (1.0," Logistic regression models is neat "))
. TODF (" label "," sentence ")

val tokenizer = new Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words")
val wordsdata = Tokenizer.transform (sentencedata)

val HASHINGTF = new HASHINGTF ().
  Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures

Val featurizeddata = Hashingtf.transform (wordsdata)
//Alternatively, Countvectorizer can also be used to get term fr Equency vectors

val IDF = new IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features")
val Idfmodel = Idf.fit (Featurizeddata)

val rescaleddata = Idfmodel.transform (featurizeddata)
rescaleddata.select (" Label "," features "). Show ()

Word2vec
Word2vec uses the probability that each word appears in the document to represent the word, and then the entire document is a vector of these probability values. It is often used for the calculation of document similarity.
Example:

Import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
Import Org.apache.spark.sql.Row

//Input Data:each Row is a bag of words from a sentence or document.
Val documentdf = Spark.createdataframe (Seq ("
  Hi I heard about Spark". Split (""),
  "I wish Java could use case class Es ". Split ("),
  "Logistic regression models is neat". Split ("")
). Map (tuple1.apply)). TODF ("text")

// Learn a mapping from words to Vectors.
Val Word2vec = new Word2vec ().
  Setinputcol ("text").
  setoutputcol ("result").
  setvectorsize (3)
  . Setmincount (0)
val model = Word2vec.fit (DOCUMENTDF)

val result = Model.transform (DOCUMENTDF)
Result.collect (). foreach {case Row (text:seq[_], features:vector) =
  println (S "Text: [${text.mkstring (", ")}] = = \nvector: $features \ n ")}

Feature ConversionsN-gram
N-gram represents a sentence consisting of n characters. The use of collocation information between adjacent words in the context, in need of continuous no space in pinyin, stroke, or the number of letters or strokes to convert into Chinese strings (i.e. sentences), you can calculate the maximum probability of the sentence, so as to achieve automatic conversion of Chinese characters, without the user manual selection, avoid many Chinese characters corresponding to a same pinyin (or a stroke string, or a string of numbers) re the problem. The model is based on the assumption that the occurrence of nth words is only related to the first N-1 words, but not to any other words, and the probability of the whole sentence is the product of the probability of each word appearing.
Example:

Import Org.apache.spark.ml.feature.NGram

val worddataframe = Spark.createdataframe (Seq (
  (0, Array ("Hi", "I", "Heard", "about", "Spark")),
  (1, Array ("I", "Wish", "Java", "could", "use", "case", "classes")),
  (2, Array (" Logistic "," Regression "," models "," is "," neat ")))
. TODF (" id "," words ")

val ngram = new Ngram (). SETN (2). Setinputcol ("words"). Setoutputcol ("Ngrams")

val ngramdataframe = Ngram.transform (worddataframe)
Ngramdataframe.select ("Ngrams"). Show (False)

Standardization
Sparkmllib uses the Normalizer class to standardize. You can specify the value of P to standardize, and the default value for P is 2.
Example:

Import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors

val dataFrame = Spark.createdataframe (Seq (
  0, Vectors.dense (1.0, 0.5, -1.0)),
  (1, Vectors.dense (2.0, 1.0, 1.0)),
  (2, Vectors.dense (4.0, 10.0, 2.0)))
. TODF ("id", "features")

//Normalize each Vector using $L ^1$ norm.
Val normalizer = new Normalizer ().
  Setinputcol ("features"). Setoutputcol (
  "Normfeatures")
  . SETP (1.0)

val l1normdata = Normalizer.transform (dataFrame)
println ("Normalized using l^1 norm")
L1normdata.show ()

///Normalize each Vector using $L ^\infty$ norm.   Infty Norm
val linfnormdata = Normalizer.transform (DataFrame, NORMALIZER.P-double.positiveinfinity)
println ("Normalized using L^inf norm")
linfnormdata.show ()

Feature Selection

Feature selection is the selection of a set of subsets from the feature collection. Because there are so many features in machine learning, we need to extract useful features. If we use a dataframe:

Userfeatures
------------------
 [0.0, 10.0, 0.5]

Its first column is 0, because it doesn't need to be removed. We use the SetIndices (1, 2) method of the Vectorslicer class for feature extraction.

Userfeatures     | features
------------------|-----------------------------
 [0.0, 10.0, 0.5] | [10.0, 0.5]

Example:

import java.util.Arrays import Org.apache.spark.ml.attribute. {Attribute, attributegroup, numericattribute} import org.apache.spark.ml.feature.VectorSlicer Import Org.apache.spark.ml.linalg.Vectors Import org.apache.spark.sql.Row Import Org.apache.spark.sql.types.StructType val data = Arrays.aslist (Row (Vectors.sparse (3, Seq ((0, -2.0), (1, 2.3))), Row (Vectors.dense ( -2.0, 2.3, 0.0))) Val Defa Ultattr = numericattribute.defaultattr val attrs = Array ("F1", "F2", "F3"). Map (defaultattr.withname) Val attrgroup = new A Ttributegroup ("Userfeatures", Attrs.asinstanceof[array[attribute]]) val DataSet = Spark.createdataframe (data, Structtype (Array (Attrgroup.tostructfield ()))) Val slicer = new Vectorslicer (). Setinputcol ("Userfeatures"). Setoutputcol ("Features") slicer.setindices (Array (1)). Setnames (Array ("F3"))//or Slicer.setindices (Array (1, 2)), or Slicer.setnames (Array ("F2", "F3")) Val output = Slicer.transform (DataSet) Output.show (false)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More