Fasttext Text Classification Usage Experience

Source: Internet
Author: User

Recently used in a project fasttext[1], this is the open source of Facebook this year a word vector and text Classification tool, there is no academic innovation, but the advantage is simple model, training speed is very fast. I tried it in a recent project and found that it was really handy to use, and that the results could be used on-line.

In fact, the model used by Fasttext and Word2vec model is the same in structure, take cbow, the difference is that the goal of Word2vec Cbow is to predict the current word by the n words of the current word, in the use of hierarchy Softmax, The leaf node of the Huffman tree is the vector of all the words in the training corpus.

While fasttext in the text classification, the Huffmax Tree leaf node is the word vector of each category label, in the course of training, each word of the training corpus will also get the corresponding word vector, input as a word in the window corresponding word vector, hidden Layer for the linear addition of these words, the result of the addition as a vector of the document, and then through the hierarchical softmax to get the prediction tag, combined with the real label of the document calculation loss, gradient and iterative update word vector.

Fasttext is different from the Word2vec of another point is the addition of ngram to divide the trick, the long word through the ngram cut into a few short words, so for the non-login words can also be cut out by the Ngram word vector into a word. Because Chinese words are mostly short, this will be more useful to English corpus than Chinese corpus.

In addition, Fasttext compared to the deep learning model has the advantage of a very fast training speed. We currently use Fasttext to carry out customer-filled order addresses to the town level category. Each province to establish a model, each model to be divided into categories have more than 1000 classes, about 2 million of the training data, 12 threads less than 1 minutes can be trained to complete, the final classification accuracy and model robustness is higher (county level classification correct accuracy is higher than 99.5%, the town level is higher than 98%), In particular, the abbreviation of place names, or the omission of the municipal administrative areas, district-level districts can also be handled correctly. parameter Aspects

The loss function uses HS (hierarchical Softmax) much faster than the NS (negative sampling) training, and the accuracy is higher.

Wordngrams default is 1, set to more than 2 can significantly improve the accuracy rate.

If the number of words is not many, you can set the bucket smaller, otherwise the reservation will reserve too many buckets to make the model too large.

[2] Because Facebook provided only the C + + version of the code, originally thought to encapsulate a python interface, the results on GitHub already has a packaged Python interface. It is particularly convenient to use, feel that it can not meet their own use requirements, modify the source code is also very convenient.

For the same text classification problem, but also with a one-way lstm done again, input pre-trained embedding word vector, and in training fine-tune, compared with fasttext, even using the GTX 980 GPU, training speed is still much slower , and the accuracy and accuracy of the fasttext are similar.

So for text classification, it is very suitable to make a simple baseline with Fasttext first.

[1]https://github.com/facebookresearch/fasttext
[2]https://github.com/salestock/fasttext.py

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.