Fasttext, a fast text classifier developed by Facebook, provides a simple and efficient way to categorize and characterize text, but there are two parts to this project. Introduction to the Theory blog: Nlp︱ advanced word vector expression (ii)--fasttext (brief, learning notes)
This new Fastrtext has also inherited two functions: training Word vector + text classification model training Source:
Https://github.com/pommedeterresautee/fastrtext Related document address:
https://pommedeterresautee.github.io/fastrtext/index.html Related blog: heavy ︱ Text mining depth of learning Word2vec R language to achieve heavy ︱r+ Nlp:text2vec packet--new Text Analysis Ecosystem No.1 (A, introduction)
In the Text2vec packet, there is a word vector glove operation.
. First, installation 1. Installation
# from Cran
install.packages (' Fastrtext ')
# from Github
# install.packages ("Devtools")
Devtools:: Install_github ("Pommedeterresautee/fastrtext")
. 2. Introduction to the main function
The following arguments are mandatory:-input training file path-output output file path The following arguments are optional:-verbose verbosity level [2] The following arguments for the diction ary are optional:-mincount minimal number of Word occurences [5]-mincountlabel minimal number of Labe L occurences [0]-wordngrams max length of Word ngram [1]-bucket number of buckets [2000000]-M Inn min Length of char ngram [3]-maxn max length of Char Ngram [6]-T SA Mpling threshold [0.0001]-label labels prefix [__label__] The following arguments for training are Onal:-LR learning rate [0.05]-lrupdaterate change the rate of updates for the learning rate [1 -dim size of Word vectors [MB]-ws size of the context window [5]-epoch Number of epochs [5]-neg number of negatives sampled [5]-loss loss function {NS, HS, SOFTMAX} [NS]-thr EAD number of threads [[]-pretrainedvectors pretrained word vectors for supervised learning []-saveout Put whether output params should is saved [0] The following arguments for quantization are: optional
Number of words and ngrams to retain [0]-retrain finetune embeddings If a cutoff is applied [0] -qnorm quantizing the norm separately [0]-qout quantizing the classifier [0]-dsub Size of each sub-vector [2]
What is the function that can be entered when the
is execute ().
-dim, vector length, default 100-D;
-wordngrams, Word type, general can choose 2, two-tuple
-verbose, the details of the output information, 0-2, different levels of detail (0 represents nothing).
-LR: Learning rate [0.1]
-lrupdaterate: Changing the rate of update rates [+]
-dim: Word vector size [?]
-ws: Context window size [5]
-epoch: Number of loops [5]
- NEG: Sample quantity [5]
-loss: Loss function {NS,HS,SOFTMAX} [ns]
-thread: Number of threads [a]
-pretrainedvectors: pre-training word vectors for supervised learning
- Saveoutput: Whether the output parameter should save [0]
. Ii. official Case One--text categorization model training 2.1 Loading data and training
Library (Fastrtext)
data ("train_sentences")
data ("Test_sentences")
# Prepare data
Tmp_file_model <-tempfile ()
train_labels <-paste0 ("__label__", train_sentences[, "Class.text"])
train_texts <- ToLower (train_sentences[, "text"])
train_to_write <-paste (train_labels, train_texts)
Train_tmp_file_ TXT <-tempfile ()
writelines (text = train_to_write, con = train_tmp_file_txt), test_labels <-paste0
("__ Label__ ", test_sentences[," Class.text "])
test_texts <-tolower (test_sentences[," text "])
test_to_ Write <-paste (test_labels, test_texts)
# Learn Model
Execute (commands = C ("Supervised", "-input", Train_ Tmp_file_txt, "-output", Tmp_file_model, "-dim", "-LR", 1, "-epoch", "-wordngrams", 2, "-verbose", 1)
This can be seen in contrast to the previously known machine learning-related models, whose model runs are arrived by execute and saved.
which
To see what the input data looks like:
The data is in char format, before __label__xxx is the label of the text, and then the space is connected to the text content.
Run Result:
# #
Read 0M Words
# # of words: 5060 # Number of
labels:15
# #
progress:100.0% words/ sec/thread:1457520 lr:0.000000 loss:0.300770 eta:0h0m
. 2.2 Validation set + running model
# Load Model
model <-Load_model (Tmp_file_model)
# Prediction are returned as a list with words and Probabili Ties
predictions <-predict (model, sentences = test_to_write)
Load_model model file location, Test_to_write is the validation text, long (in fact, the same as the training set):
Show:
Print (Head (predictions, 5))
# # [[1]]
# # __label__ownx
# # 0.9980469 # # # [
[2]]
# # __label__misc # # 0.9863281 # # [
3]]
# # __label__misc # # 0.9921875
# # #
# [[4]] # #
__label__ownx # # 0.9082031 # # # [
5]]
# # __LABEL__AIMX
# 0.984375
. 2.3 Model Validation
Calculation accuracy Rate
# Compute accuracy
mean (sapply (predictions, names) = = Test_labels)
Calculate the Hamming distance
# because there is only one category by observation, Hamming loss'll be the same
Get_hamming_loss (as.list s), predictions)
# # [1] 0.8316667
. 2.4 Some small functions
See what the label for the monitoring model has, the Get_labels function.
If you have trained the model, put it for a while, and do not know what tags there are, you can find it.
Model <-Load_model (model_test_path)
print (Head (get_labels (model), 5))
#> [1] "__label__misc" "__label __ownx "" __label__aimx "" __label__cont "
#> [5]" __label__base "
Look at the parameters of the model. Get_parameters:
Model <-Load_model (model_test_path)
print (Head (get_parameters (model), 5))
#> $learning _rate
# > [1] 0.05
#>
#> $learning _rate_update
#> [1]
#>
#> $dim
#> [1] 20
#>
#> $context _window_size
#> [1] 5
#>
#> $epoch
#> [1] 20
. Iii. Official case Two--calculating the word vector 3.1 Loading Data + training
Library (Fastrtext)
data ("train_sentences")
data ("Test_sentences")
texts <-ToLower (train_ sentences[, "text"])
tmp_file_txt <-tempfile () Tmp_file_model <-tempfile
()
writelines (Text = texts, con = tmp_file_txt)
Execute (commands = C ("Skipgram", "-input", Tmp_file_txt, "-output", Tmp_file_model, "- Verbose ", 1))
Commands inside the parameters are: "Skipgram", that is, the calculation of the word vector, and word2vec consistent.
Enter the text content without tag information:
. 3.2 Word Vectors
Model <-Load_model (Tmp_file_model)
Load the word vector file, load the bin file
# test Word extraction
dict <-get_dictionary (model)
print (Head (dict, 5))
# # [1] "the" </s> " "of" "to" "and "
Dict is the dictionary of Word vectors,
# Print vector
print (get_word_vectors (model, C ("Time", "timing"))
Show the dimensions of the word vector.
. 3.3 Calculating word vector distance--get_word_distance
# test Word distance
get_word_distance (model, "time", "timing")
# # [, 1]
# # [1,] 0.02767485
. 3.4 Find the nearest neighbor Word--get_nn
There are only three get_nn parameters, and the last number represents the number of synonyms before the selection.
Library (fastrtext)
model_test_path <-system.file ("Extdata", "model_unsupervised_test.bin", package = " Fastrtext ")
model <-Load_model (model_test_path)
get_nn (model," time ", ten)
#> times Size indicate access success allowing feelings #> 0.6120564 0.4941387 0.4777856 0.4719051 0.4696053 0.4652924 #> dictator Amino accuracies
#> 0.4595046 0.4582702
. analogy--get_analogies of 3.5 words
Library (fastrtext)
model_test_path <-system.file ("Extdata", "model_unsupervised_test.bin", package = " Fastrtext ")
model <-Load_model (model_test_path)
get_analogies (model," experience "," experiences "," Result ")
#> Results
Analogy relation type:
Get_analogies (model, W1, W2, W3, k = 1)
W1-W2 + W3
And that is:
Experience-experiences + Result