Sparkmllib feature extraction, feature transformation and feature selection

Source: Internet
Author: User
Tags arrays normalizer idf
Feature ExtractionTf-idf

TF-IDF is generally used in text mining to reflect the importance of a feature item. Set the feature item to T, the document is D, and the document set is D. The feature frequency (term frequency) TF (T,D) for the feature item appears in document D in number of times. Document frequency (Documents frequency) DF (T,D) represents the number of documents with the feature item T. If you are only using TF to measure importance, it is not useful to have multiple occurrences of a document but with minimal amount of information. It is therefore possible to use the inverse document frequency IDF (inverse document frequency) to measure the importance of feature items, as follows:

| D| represents the total number of documents, obviously if T appears in all documents, the IDF value is 0. Then the TF-IDF is:

Example:

Import Org.apache.spark.ml.feature. {HASHINGTF, IDF, tokenizer}

Val sentencedata = Spark.createdataframe (Seq (
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could with Case c Lasses "),
  (1.0," Logistic regression models is neat "))
. TODF (" label "," sentence ")

val tokenizer = new Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words")
val wordsdata = Tokenizer.transform (sentencedata)

val HASHINGTF = new HASHINGTF ().
  Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures

Val featurizeddata = Hashingtf.transform (wordsdata)
//Alternatively, Countvectorizer can also be used to get term fr Equency vectors

val IDF = new IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features")
val Idfmodel = Idf.fit (Featurizeddata)

val rescaleddata = Idfmodel.transform (featurizeddata)
rescaleddata.select (" Label "," features "). Show ()
Word2vec
Word2vec uses the probability that each word appears in the document to represent the word, and then the entire document is a vector of these probability values. It is often used for the calculation of document similarity.
Example:
Import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
Import Org.apache.spark.sql.Row

//Input Data:each Row is a bag of words from a sentence or document.
Val documentdf = Spark.createdataframe (Seq ("
  Hi I heard about Spark". Split (""),
  "I wish Java could use case class Es ". Split ("),
  "Logistic regression models is neat". Split ("")
). Map (tuple1.apply)). TODF ("text")

// Learn a mapping from words to Vectors.
Val Word2vec = new Word2vec ().
  Setinputcol ("text").
  setoutputcol ("result").
  setvectorsize (3)
  . Setmincount (0)
val model = Word2vec.fit (DOCUMENTDF)

val result = Model.transform (DOCUMENTDF)
Result.collect (). foreach {case Row (text:seq[_], features:vector) =
  println (S "Text: [${text.mkstring (", ")}] = = \nvector: $features \ n ")}
Feature ConversionsN-gram
N-gram represents a sentence consisting of n characters. The use of collocation information between adjacent words in the context, in need of continuous no space in pinyin, stroke, or the number of letters or strokes to convert into Chinese strings (i.e. sentences), you can calculate the maximum probability of the sentence, so as to achieve automatic conversion of Chinese characters, without the user manual selection, avoid many Chinese characters corresponding to a same pinyin (or a stroke string, or a string of numbers) re the problem. The model is based on the assumption that the occurrence of nth words is only related to the first N-1 words, but not to any other words, and the probability of the whole sentence is the product of the probability of each word appearing.
Example:
Import Org.apache.spark.ml.feature.NGram

val worddataframe = Spark.createdataframe (Seq (
  (0, Array ("Hi", "I", "Heard", "about", "Spark")),
  (1, Array ("I", "Wish", "Java", "could", "use", "case", "classes")),
  (2, Array (" Logistic "," Regression "," models "," is "," neat ")))
. TODF (" id "," words ")

val ngram = new Ngram (). SETN (2). Setinputcol ("words"). Setoutputcol ("Ngrams")

val ngramdataframe = Ngram.transform (worddataframe)
Ngramdataframe.select ("Ngrams"). Show (False)
Standardization
Sparkmllib uses the Normalizer class to standardize. You can specify the value of P to standardize, and the default value for P is 2.
Example:
Import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors

val dataFrame = Spark.createdataframe (Seq (
  0, Vectors.dense (1.0, 0.5, -1.0)),
  (1, Vectors.dense (2.0, 1.0, 1.0)),
  (2, Vectors.dense (4.0, 10.0, 2.0)))
. TODF ("id", "features")

//Normalize each Vector using $L ^1$ norm.
Val normalizer = new Normalizer ().
  Setinputcol ("features"). Setoutputcol (
  "Normfeatures")
  . SETP (1.0)

val l1normdata = Normalizer.transform (dataFrame)
println ("Normalized using l^1 norm")
L1normdata.show ()

///Normalize each Vector using $L ^\infty$ norm.   Infty Norm
val linfnormdata = Normalizer.transform (DataFrame, NORMALIZER.P-double.positiveinfinity)
println ("Normalized using L^inf norm")
linfnormdata.show ()
Feature Selection

Feature selection is the selection of a set of subsets from the feature collection. Because there are so many features in machine learning, we need to extract useful features. If we use a dataframe:

Userfeatures
------------------
 [0.0, 10.0, 0.5]

Its first column is 0, because it doesn't need to be removed. We use the SetIndices (1, 2) method of the Vectorslicer class for feature extraction.

Userfeatures     | features
------------------|-----------------------------
 [0.0, 10.0, 0.5] | [10.0, 0.5]

Example:

import java.util.Arrays import Org.apache.spark.ml.attribute. {Attribute, attributegroup, numericattribute} import org.apache.spark.ml.feature.VectorSlicer Import Org.apache.spark.ml.linalg.Vectors Import org.apache.spark.sql.Row Import Org.apache.spark.sql.types.StructType val data = Arrays.aslist (Row (Vectors.sparse (3, Seq ((0, -2.0), (1, 2.3))), Row (Vectors.dense ( -2.0, 2.3, 0.0))) Val Defa Ultattr = numericattribute.defaultattr val attrs = Array ("F1", "F2", "F3"). Map (defaultattr.withname) Val attrgroup = new A Ttributegroup ("Userfeatures", Attrs.asinstanceof[array[attribute]]) val DataSet = Spark.createdataframe (data, Structtype (Array (Attrgroup.tostructfield ()))) Val slicer = new Vectorslicer (). Setinputcol ("Userfeatures"). Setoutputcol ("Features") slicer.setindices (Array (1)). Setnames (Array ("F3"))//or Slicer.setindices (Array (1, 2)), or Slicer.setnames (Array ("F2", "F3")) Val output = Slicer.transform (DataSet) Output.show (false) 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.