1.Feature extractors (feature extraction)
1.1 TF-IDF
Word frequency (term Frequency)-reverse document frequencies (inverse documents Frequency) is a feature vectorization method that is widely used in text mining to assess the importance of a term to one file set or one document in a corpus. Definition: T is represented by a word, D represents a document, D represents a corpus of multiple documents (corpus), and Word frequency TF (t,d) indicates how often a given words T appears in the file D. The document Frequency DF (T,D) represents the frequency at which the word T appears throughout Corpus D. If we only use Word frequency TF to evaluate the importance of words, it is easy to overemphasize some words that often appear and do not contain too much information about the document, such as "one", "that", and "." If a word appears very frequently throughout the corpus, it means that it does not carry some special information about a particular document, in other words, the word is less important to the entire document. Reverse document frequency is a measure of the importance of a word to a document. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and then the obtained quotient logarithm is obtained
IDF (t,d) =log| D|+1DF (t,d) +1 IDF (t,d) = \log{{| d| + 1} \over {DF (t,d) + 1}}
Among them, | d| Is the total number of documents in the corpus. Because the logarithm is used, if a word appears in all the files, its IDF value becomes 0. Note that in order to prevent a denominator of 0, the denominator needs to be added 1. Therefore, the TF-IDF is defined as the product of TF and IDF:
TFIDF (t,d,d) =TF (t,d) ⋅IDF (t,d) TFIDF (t,d,d) = TF (t,d) \cdot IDF (T, D)
There are many forms of definition of the word frequency TF and the document frequencies DF. In Mllib, we will use TF and IDF independently as needed.
TF (term Frequency): both HASHINGTF and Countvectorizer can be used to generate word frequency tf vectors.
The
HASHINGTF is a converter (Transformer) that converts a feature phrase into a set of feature vectors of a given length (word frequency). In the text processing, "characteristic phrase" has a series of characteristic words composition. HASHINGTF uses the hashing trick to map the original feature (raw feature) through a hash function to the index of the low-dimensional vector. The hash function used here is MurmurHash 3. The word frequency (TF) is obtained by mapping a low-dimensional vector calculation. This method avoids the significant number of direct computations (generated by the creation of vector term-to-index by feature words). (direct calculation of the Term-to-index vector) is very expensive to calculate for a larger corpus. However, this dimensionality reduction method may also have a hash conflict: different primitive features get the same value after the hash function (f (x1) = f (x2)). To reduce the probability of a hash collision, we can increase the feature dimension of the hash value, for example: increasing the number of buckets in the hash table. A simple example: the conversion of a hash function to the index of a column, this method applies to the power of 2 (function) as a feature dimension, otherwise (with other mapping methods) the feature can not be evenly mapped to the hash value. The default feature dimension is 218=262,144218=262,144 2^{18}=262,144218=262,144. An optional binary switch parameter controls the word frequency count. When set to true, all non-0 word frequency settings are set to 1. This is useful for discrete binary probabilistic model calculations. The
Countvectorizer can convert a text document into a vector set of keywords. Please read the original countvectorizer for more details.
IDF (Inverse document Frequency): IDF is the Weight evaluator (estimator), which generates corresponding Idfmodel for the dataset (different word frequency corresponds to different weights). Idfmodel log (log) processing of eigenvector sets (typically produced by HASHINGTF or Countvectorizer). Visually, the more documents the feature word appears, the lower the weight (down-weights colume).
Note: SPARK.ML does not provide word breaker tools and methods for text. We recommend users to refer to Stanford NLP Group and Scalanlp/chalk.
Example
in the following code snippet, let's start with a set of sentence processing. We use the tokenizer to break each sentence into a series of words. For each sentence (Word bag, Word set: bag of words), we use HASHINGTF to convert a sentence into a eigenvector. Finally use the IDF to readjust the eigenvector (dimension). This method improves the performance of the text feature (OP). Then the eigenvector we extracted can be passed into the learning algorithm as input parameters.
import org.apache.spark.ml.feature. {HASHINGTF, IDF, tokenizer} val sentencedata = Spark.createdataframe (Seq (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes "), (1," Logistic regression models is neat ")). TODF (" label "," sentence ") Val tokenizer = New Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words") val Wordsdata = Tokenizer.transform (Sentencedata) Val HASHINGTF = new HASHINGTF (). Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures (+) Val featurizeddata = Hashingtf.transform (wordsdata)//Alternatively, Countvectorizer can also be used to get term frequency vectors val IDF = New IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features") Val Idfmodel = Idf.fit (featurizeddata) Val Rescaleddata = Idfmodel.transform (Featurizeddata) Rescaleddata.select ("Features", "label"). Take (3). foreach (println)
Complete sample code can be found under "Examples/src/main/scala/org/apache/spark/examples/ml/tfidfexample.scala" in spark repo. 1.2 Word2vec
Word2vec is a estimator (model evaluator) that represents the semantic similarity of a document by means of a word vector, which trains the Word2vecmodel model. The model maps each word of (text) to a single, fixed-size word vector (which corresponds to the text). Word2vecmodel converts each document to a word vector by the average (conditional probability) of the text word; This vector can be used as feature prediction, document similarity calculation, and so on. Please read the original Word2vec MLlib User Guide to learn more details.
In the following code snippet, we take a set of documents as an example, each group consists of a series of words (sequences). Each document becomes a vector of feature words by Word2vec. This feature vector can be passed to the machine learning algorithm (as an input parameter).
Import Org.apache.spark.ml.feature.Word2Vec
//Input Data:each row is a bag of words from a sentence or document.
val DOCUMENTDF = Spark.createdataframe (Seq ("
Hi I heard about Spark". Split (""),
"I wish Java could use case clas Ses ". Split ("),
"Logistic regression models is neat". Split ("")
). Map (tuple1.apply)). TODF ("text")
// Learn a mapping from words to Vectors.
Val Word2vec = new Word2vec ().
Setinputcol ("text").
setoutputcol ("result").
setvectorsize (3)
. Setmincount (0)
val model = Word2vec.fit (DOCUMENTDF)
val result = Model.transform (DOCUMENTDF)
Result.select ("result"). Take (3). foreach (println)
Running in the spark shell, the results are as follows:
[[ -0.006959987431764603,-0.002663574367761612,0.030144984275102617]]
[[0.03422858566045761,0.026469426163073094,-0.02045729543481554]]
[[0.04996728524565697,0.0027822263538837435,0.04833737155422568]]
DocumentDF:org.apache.spark.sql.DataFrame = [Text:array<string>]
Word2vec: Org.apache.spark.ml.feature.Word2Vec = W2v_492d428f3aef
Model:org.apache.spark.ml.feature.Word2VecModel = w2v_ 492D428F3AEF
result:org.apache.spark.sql.DataFrame = [Text:array<string>, Result:vector]
Complete sample code can be found under "Examples/src/main/scala/org/apache/spark/examples/ml/word2vecexample.scala" in spark repo. 1.3 Countvectorizer
Countvectorizer and Countvectorizermodel are designed to convert text documents to eigenvectors by counting. When there is no priori dictionary, Countvectorizer can extract words as estimator and generate Countvectorizermodel. The model produces a sparse representation of the vocabulary of the document (sparse eigenvectors), which can be passed to other like LDA algorithms.
In the process of fitting fitting, Countvectorizer will select the first vocabsize words according to the word frequency in the corpus. One of the configuration parameters mindf the process of fitting (fitting) by specifying the minimum number of times (or Word frequency if < 1.0) that the words in the glossary appear in the document. Another configurable binary toggle parameter controls the output vector. If set to true then all non-0 counts are set to 1. This is useful for the two-element discrete probability model.
Examples
Suppose we have the following dataframe containing ID and texts two columns:
ID | Texts
----|----------
0 | Array ("A", "B", "C")
1 | Array ("A", "B", "B", "C", "a")
Each line in the text is an array of text types (strings). Call Countvectorizer to generate the Countvectorizermodel model of the glossary (A, B, C), followed by the output vector as follows:
ID | Texts | vector
----|---------------------------------|---------------
0 | Array ("A", "B", "C") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array ("A", "B", "B", "C", "a") | (3,[0,1,2],[2.0,2.0,1.0])
Each vector represents the number of occurrences of each word in the document glossary
Import Org.apache.spark.ml.feature. {Countvectorizer, Countvectorizermodel}
Val df = spark.createdataframe (Seq (
0, Array ("A", "B", "C")),
(1, Array ("A", "B", "B", "C", "a")))
. TODF ("I D "," words ")
//Fit a Countvectorizermodel from the corpus
val cvmodel:countvectorizermodel = new Countvectorizer ().
Setinputcol ("words").
Setoutputcol ("features").
setvocabsize (3)
. SETMINDF (2)
. Fit (DF)
//Alternatively, define Countvectorizermodel with A-priori vocabulary
val CVM = new Countvectorizermodel ( Array ("A", "B", "C")).
Setinputcol ("words")
. Setoutputcol ("features")
Cvmodel.transform (DF). Select ("Features"). Show ()
Please read the English original Countvectorizer Scala documentation and Countvectorizermodel Scala documentation for more information about the relevant APIs.
Find the complete sample code in spark repo in "Examples/src/main/scala/org/apache/spark/examples/ml/countvectorizerexample.scala". 2.Feature transformers (Feature transform) 2.1 tokenizer (word breaker)
Tokenization (text symbolization) is the process of splitting text (such as a sentence) into words. (in Spark ml) the tokenizer (word breaker) provides this functionality. The following example shows how to split a sentence into a sequence of words.
Regextokenizer provides (more advanced) word splits (for sentences or text) that are based on a regular expression (regex) match. By default, the parameter "pattern" (the default regular expression: "\s+") is used as the delimiter to split the input text. Alternatively, the user can set the parameter "gaps" to false, specifying that the regular expression "pattern" means "tokens", not the delimiter, so that all occurrences of the word-breaker are found.
import org.apache.spark.ml.feature. {Regextokenizer, tokenizer} val sentencedataframe = Spark.createdataframe (Seq (0, "Hi I heard about Spark"), (1, "I Wish Java could use case classes "), (2," logistic,regression,models,are,neat ")). TODF (" label "," sentence ") Val Tokeniz ER = new Tokenizer (). Setinputcol ("sentence"). Setoutputcol ("words") val Regextokenizer = new Regextokenizer (). SETINPUTC OL ("sentence"). Setoutputcol ("words"). Setpattern ("\\w")//alternatively. Setpattern ("\\w+"). Setgaps (False) Val Toke nized = Tokenizer.transform (sentencedataframe) tokenized.select ("Words", "label"). Take (3). foreach (println) val regextokenized = Regextokenizer.transform (sentencedataframe) regextokenized.select ("Words", "label"). Take (3). foreach (println)
[0,hi I heard about Spark]
[1,i wish Java could use case classes]
[2,logistic,regression,models,are,neat]
[Wrappedarray (Hi, I, heard, about, Spark), 0]
[Wrappedarray (I, Wish, Java, could, use, case, classes), 1]
[Wrappedarray (Logistic,regression,models,are,neat), 2]
Find the complete sample code in spark repo in "Examples/src/main/scala/org/apache/spark/examples/ml/tokenizerexample.scala". 2.2 Stopwordsremover (disable word purge)
Stop words (disabled words) are words that appear frequently (in a document) but do not carry much meaning, and they should not participate in algorithmic operations.
Stopwordsremover (the function of) is to delete (after output) the deactivated words in the input string (such as the output of the word breaker tokenizer). The deactivated Word table is specified by the Stopwords parameter. The default stop words for some languages are set by calling Stopwordsremover.loaddefaultstopwords (language), and the available options are "Denmark", "Dutch", "English", "Finnish", "France", "Germany", "Hungary "," Italy "," Norway "," Portugal "," Russia "," Spain "," Sweden "and" Turkey ". The Boolean parameter casesensitive indicates case sensitivity (default is NO).
Examples
Suppose you have the following dataframe, with ID and raw two columns:
ID | Raw
----|----------
0 | [I, saw, the, red, Baloon]
1 | [Mary, had, a, little, lamb]
By calling Stopwordsremover on the raw column, we can get the filtered result columns as follows:
ID | Raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, Baloon] | [Saw, Red, Baloon]
1 | [Mary, had, a, little, lamb]| [Mary, Little, Lamb]
Among them, "I", "the", "had" and "a" are removed.
Import Org.apache.spark.ml.feature.StopWordsRemover
val remover = new Stopwordsremover ()
. Setinputcol (" Raw ")
. Setoutputcol (" filtered ")
val dataSet = spark.createdataframe (seq (
(0, seq (" I "," Saw "," the "," Red "," Baloon ")),
(1, Seq (" Mary "," had "," a "," little "," Lamb ")))
. TODF (" id "," raw "
) Remover.transform (DataSet). Show ()
The full sample code can be found in the path "Examples/src/main/scala/org/apache/spark/examples/ml/stopwordsremoverexample.scala" in spark repo. 2.3 N-gram
A n-gram is a sequence of words with a length of n (integers). Ngram can be used to convert input features into n-grams.
The input to the NGram is a series of strings (for example, the output of the Tokenizer word breaker). The parameter n represents the number of words (terms) in each n-gram. The output of Ngram is a sequence of multiple n-grams, where each n-gram represents n consecutive words separated by a space. If the input string is less than n words, the Ngram output is empty.
Import Org.apache.spark.ml.feature.NGram
val worddataframe = Spark.createdataframe (Seq (
(0, Array ("Hi", "I", "Heard", "about", "Spark")),
(1, Array ("I", "Wish", "Java", "could", "use", "case", "classes")),
(2, Array (" Logistic "," Regression "," Models ", &