Feature ExtractionTf-idf
TF-IDF is generally used in text mining to reflect the importance of a feature item. Set the feature item to T, the document is D, and the document set is D. The feature frequency (term frequency) TF (T,D) for the feature item appears in document D in number of times. Document frequency (Documents frequency) DF (T,D) represents the number of documents with the feature item T. If you are only using TF to measure importance, it is not useful to have multiple occurrences of a document but with minimal amount of information. It is therefore possible to use the inverse document frequency IDF (inverse document frequency) to measure the importance of feature items, as follows:
| D| represents the total number of documents, obviously if T appears in all documents, the IDF value is 0. Then the TF-IDF is:
Example:
Import Org.apache.spark.ml.feature. {HASHINGTF, IDF, tokenizer}
Val sentencedata = Spark.createdataframe (Seq (
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could with Case c Lasses "),
(1.0," Logistic regression models is neat "))
. TODF (" label "," sentence ")
val tokenizer = new Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words")
val wordsdata = Tokenizer.transform (sentencedata)
val HASHINGTF = new HASHINGTF ().
Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures
Val featurizeddata = Hashingtf.transform (wordsdata)
//Alternatively, Countvectorizer can also be used to get term fr Equency vectors
val IDF = new IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features")
val Idfmodel = Idf.fit (Featurizeddata)
val rescaleddata = Idfmodel.transform (featurizeddata)
rescaleddata.select (" Label "," features "). Show ()
Word2vec
Word2vec uses the probability that each word appears in the document to represent the word, and then the entire document is a vector of these probability values. It is often used for the calculation of document similarity.
Example:
Import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
Import Org.apache.spark.sql.Row
//Input Data:each Row is a bag of words from a sentence or document.
Val documentdf = Spark.createdataframe (Seq ("
Hi I heard about Spark". Split (""),
"I wish Java could use case class Es ". Split ("),
"Logistic regression models is neat". Split ("")
). Map (tuple1.apply)). TODF ("text")
// Learn a mapping from words to Vectors.
Val Word2vec = new Word2vec ().
Setinputcol ("text").
setoutputcol ("result").
setvectorsize (3)
. Setmincount (0)
val model = Word2vec.fit (DOCUMENTDF)
val result = Model.transform (DOCUMENTDF)
Result.collect (). foreach {case Row (text:seq[_], features:vector) =
println (S "Text: [${text.mkstring (", ")}] = = \nvector: $features \ n ")}
Feature ConversionsN-gram
N-gram represents a sentence consisting of n characters. The use of collocation information between adjacent words in the context, in need of continuous no space in pinyin, stroke, or the number of letters or strokes to convert into Chinese strings (i.e. sentences), you can calculate the maximum probability of the sentence, so as to achieve automatic conversion of Chinese characters, without the user manual selection, avoid many Chinese characters corresponding to a same pinyin (or a stroke string, or a string of numbers) re the problem. The model is based on the assumption that the occurrence of nth words is only related to the first N-1 words, but not to any other words, and the probability of the whole sentence is the product of the probability of each word appearing.
Example:
Import Org.apache.spark.ml.feature.NGram
val worddataframe = Spark.createdataframe (Seq (
(0, Array ("Hi", "I", "Heard", "about", "Spark")),
(1, Array ("I", "Wish", "Java", "could", "use", "case", "classes")),
(2, Array (" Logistic "," Regression "," models "," is "," neat ")))
. TODF (" id "," words ")
val ngram = new Ngram (). SETN (2). Setinputcol ("words"). Setoutputcol ("Ngrams")
val ngramdataframe = Ngram.transform (worddataframe)
Ngramdataframe.select ("Ngrams"). Show (False)
Standardization
Sparkmllib uses the Normalizer class to standardize. You can specify the value of P to standardize, and the default value for P is 2.
Example:
Import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors
val dataFrame = Spark.createdataframe (Seq (
0, Vectors.dense (1.0, 0.5, -1.0)),
(1, Vectors.dense (2.0, 1.0, 1.0)),
(2, Vectors.dense (4.0, 10.0, 2.0)))
. TODF ("id", "features")
//Normalize each Vector using $L ^1$ norm.
Val normalizer = new Normalizer ().
Setinputcol ("features"). Setoutputcol (
"Normfeatures")
. SETP (1.0)
val l1normdata = Normalizer.transform (dataFrame)
println ("Normalized using l^1 norm")
L1normdata.show ()
///Normalize each Vector using $L ^\infty$ norm. Infty Norm
val linfnormdata = Normalizer.transform (DataFrame, NORMALIZER.P-double.positiveinfinity)
println ("Normalized using L^inf norm")
linfnormdata.show ()
Feature Selection
Feature selection is the selection of a set of subsets from the feature collection. Because there are so many features in machine learning, we need to extract useful features. If we use a dataframe:
Userfeatures
------------------
[0.0, 10.0, 0.5]
Its first column is 0, because it doesn't need to be removed. We use the SetIndices (1, 2) method of the Vectorslicer class for feature extraction.
Userfeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
Example:
import java.util.Arrays import Org.apache.spark.ml.attribute. {Attribute, attributegroup, numericattribute} import org.apache.spark.ml.feature.VectorSlicer Import Org.apache.spark.ml.linalg.Vectors Import org.apache.spark.sql.Row Import Org.apache.spark.sql.types.StructType val data = Arrays.aslist (Row (Vectors.sparse (3, Seq ((0, -2.0), (1, 2.3))), Row (Vectors.dense ( -2.0, 2.3, 0.0))) Val Defa Ultattr = numericattribute.defaultattr val attrs = Array ("F1", "F2", "F3"). Map (defaultattr.withname) Val attrgroup = new A Ttributegroup ("Userfeatures", Attrs.asinstanceof[array[attribute]]) val DataSet = Spark.createdataframe (data, Structtype (Array (Attrgroup.tostructfield ()))) Val slicer = new Vectorslicer (). Setinputcol ("Userfeatures"). Setoutputcol ("Features") slicer.setindices (Array (1)). Setnames (Array ("F3"))//or Slicer.setindices (Array (1, 2)), or Slicer.setnames (Array ("F2", "F3")) Val output = Slicer.transform (DataSet) Output.show (false)