Fastrtext︱r language using Facebook's Fasttext fast text categorization algorithm __ algorithm

Source: Internet
Author: User

Fasttext, a fast text classifier developed by Facebook, provides a simple and efficient way to categorize and characterize text, but there are two parts to this project. Introduction to the Theory blog: Nlp︱ advanced word vector expression (ii)--fasttext (brief, learning notes)

This new Fastrtext has also inherited two functions: training Word vector + text classification model training Source:

Https://github.com/pommedeterresautee/fastrtext Related document address:

https://pommedeterresautee.github.io/fastrtext/index.html Related blog: heavy ︱ Text mining depth of learning Word2vec R language to achieve heavy ︱r+ Nlp:text2vec packet--new Text Analysis Ecosystem No.1 (A, introduction)
In the Text2vec packet, there is a word vector glove operation.

. First, installation 1. Installation

# from Cran
install.packages (' Fastrtext ')

# from Github
# install.packages ("Devtools")
Devtools:: Install_github ("Pommedeterresautee/fastrtext")

. 2. Introduction to the main function

The following arguments are mandatory:-input training file path-output output file path The following arguments are optional:-verbose verbosity level [2] The following arguments for the diction ary are optional:-mincount minimal number of Word occurences [5]-mincountlabel minimal number of Labe L occurences [0]-wordngrams max length of Word ngram [1]-bucket number of buckets [2000000]-M Inn min Length of char ngram [3]-maxn max length of Char Ngram [6]-T SA Mpling threshold [0.0001]-label labels prefix [__label__] The following arguments for training are Onal:-LR learning rate [0.05]-lrupdaterate change the rate of updates for the learning rate [1              -dim size of Word vectors [MB]-ws size of the context window [5]-epoch Number of epochs [5]-neg number of negatives sampled [5]-loss loss function {NS, HS, SOFTMAX} [NS]-thr EAD number of threads [[]-pretrainedvectors pretrained word vectors for supervised learning []-saveout             Put whether output params should is saved [0] The following arguments for quantization are: optional
  Number of words and ngrams to retain [0]-retrain finetune embeddings If a cutoff is applied [0]               -qnorm quantizing the norm separately [0]-qout quantizing the classifier [0]-dsub Size of each sub-vector [2]

What is the function that can be entered when the

is execute ().
-dim, vector length, default 100-D;
-wordngrams, Word type, general can choose 2, two-tuple
-verbose, the details of the output information, 0-2, different levels of detail (0 represents nothing).
-LR: Learning rate [0.1]
-lrupdaterate: Changing the rate of update rates [+]
-dim: Word vector size [?]
-ws: Context window size [5]
-epoch: Number of loops [5]
- NEG: Sample quantity [5]
-loss: Loss function {NS,HS,SOFTMAX} [ns]
-thread: Number of threads [a]
-pretrainedvectors: pre-training word vectors for supervised learning
- Saveoutput: Whether the output parameter should save [0]
. Ii. official Case One--text categorization model training 2.1 Loading data and training

Library (Fastrtext)

data ("train_sentences")
data ("Test_sentences")

# Prepare data
Tmp_file_model <-tempfile ()

train_labels <-paste0 ("__label__", train_sentences[, "Class.text"])
train_texts <- ToLower (train_sentences[, "text"])
train_to_write <-paste (train_labels, train_texts)
Train_tmp_file_ TXT <-tempfile ()
writelines (text = train_to_write, con = train_tmp_file_txt), test_labels <-paste0

("__ Label__ ", test_sentences[," Class.text "])
test_texts <-tolower (test_sentences[," text "])
test_to_ Write <-paste (test_labels, test_texts)

# Learn Model
Execute (commands = C ("Supervised", "-input", Train_ Tmp_file_txt, "-output", Tmp_file_model, "-dim", "-LR", 1, "-epoch", "-wordngrams", 2, "-verbose", 1)

This can be seen in contrast to the previously known machine learning-related models, whose model runs are arrived by execute and saved.
which

To see what the input data looks like:

The data is in char format, before __label__xxx is the label of the text, and then the space is connected to the text content.

Run Result:

# # 
Read 0M Words
# # of words:  5060 # Number of
labels:15
# # 
progress:100.0%  words/ sec/thread:1457520  lr:0.000000  loss:0.300770  eta:0h0m

. 2.2 Validation set + running model

# Load Model
model <-Load_model (Tmp_file_model)
# Prediction are returned as a list with words and Probabili Ties
predictions <-predict (model, sentences = test_to_write)

Load_model model file location, Test_to_write is the validation text, long (in fact, the same as the training set):

Show:

Print (Head (predictions, 5))
# # [[1]]
# # __label__ownx 
# # 0.9980469 # # # [ 
[2]]
# # __label__misc # # 0.9863281 # # [ 
3]]
# # __label__misc # #     0.9921875 
# # # 
# [[4]] # #
__label__ownx # # 0.9082031 # # # [ 
5]]
# # __LABEL__AIMX 
#      0.984375

. 2.3 Model Validation

Calculation accuracy Rate

# Compute accuracy
mean (sapply (predictions, names) = = Test_labels)

Calculate the Hamming distance

# because there is only one category by observation, Hamming loss'll be the same
Get_hamming_loss (as.list s), predictions)
# # [1] 0.8316667

. 2.4 Some small functions

See what the label for the monitoring model has, the Get_labels function.
If you have trained the model, put it for a while, and do not know what tags there are, you can find it.

Model <-Load_model (model_test_path)
print (Head (get_labels (model), 5))
#> [1] "__label__misc" "__label __ownx "" __label__aimx "" __label__cont "
#> [5]" __label__base "

Look at the parameters of the model. Get_parameters:

Model <-Load_model (model_test_path)
print (Head (get_parameters (model), 5))
#> $learning _rate
# > [1] 0.05
#> 
#> $learning _rate_update
#> [1]
#> 
#> $dim
#> [1] 20
#> 
#> $context _window_size
#> [1] 5
#> 
#> $epoch
#> [1] 20

. Iii. Official case Two--calculating the word vector 3.1 Loading Data + training

Library (Fastrtext)

    data ("train_sentences")
    data ("Test_sentences")
    texts <-ToLower (train_ sentences[, "text"])
    tmp_file_txt <-tempfile () Tmp_file_model <-tempfile
    ()
    writelines (Text = texts, con = tmp_file_txt)
    Execute (commands = C ("Skipgram", "-input", Tmp_file_txt, "-output", Tmp_file_model, "- Verbose ", 1))

Commands inside the parameters are: "Skipgram", that is, the calculation of the word vector, and word2vec consistent.
Enter the text content without tag information:

. 3.2 Word Vectors

Model <-Load_model (Tmp_file_model)

Load the word vector file, load the bin file

# test Word extraction
    dict <-get_dictionary (model)
    print (Head (dict, 5))
# # [1] "the"  </s> " "of" "to" "and   "

Dict is the dictionary of Word vectors,

# Print vector
  print (get_word_vectors (model, C ("Time", "timing"))

Show the dimensions of the word vector.

. 3.3 Calculating word vector distance--get_word_distance

# test Word distance
  get_word_distance (model, "time", "timing")
# #            [, 1]
# # [1,] 0.02767485

. 3.4 Find the nearest neighbor Word--get_nn

There are only three get_nn parameters, and the last number represents the number of synonyms before the selection.

Library (fastrtext)
model_test_path <-system.file ("Extdata", "model_unsupervised_test.bin", package = " Fastrtext ")
model <-Load_model (model_test_path)
get_nn (model," time ", ten)
#>      times       Size   indicate     access    success   allowing   feelings  #>  0.6120564 0.4941387  0.4777856  0.4719051  0.4696053  0.4652924      #> dictator Amino accuracies 
#>  0.4595046  0.4582702  

. analogy--get_analogies of 3.5 words

Library (fastrtext)
model_test_path <-system.file ("Extdata", "model_unsupervised_test.bin", package = " Fastrtext ")
model <-Load_model (model_test_path)
get_analogies (model," experience "," experiences "," Result ")
#>  Results 

Analogy relation type:
Get_analogies (model, W1, W2, W3, k = 1)
W1-W2 + W3
And that is:
Experience-experiences + Result

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.