Fasttext Text Classification Usage Experience

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently used in a project fasttext[1], this is the open source of Facebook this year a word vector and text Classification tool, there is no academic innovation, but the advantage is simple model, training speed is very fast. I tried it in a recent project and found that it was really handy to use, and that the results could be used on-line.

In fact, the model used by Fasttext and Word2vec model is the same in structure, take cbow, the difference is that the goal of Word2vec Cbow is to predict the current word by the n words of the current word, in the use of hierarchy Softmax, The leaf node of the Huffman tree is the vector of all the words in the training corpus.

While fasttext in the text classification, the Huffmax Tree leaf node is the word vector of each category label, in the course of training, each word of the training corpus will also get the corresponding word vector, input as a word in the window corresponding word vector, hidden Layer for the linear addition of these words, the result of the addition as a vector of the document, and then through the hierarchical softmax to get the prediction tag, combined with the real label of the document calculation loss, gradient and iterative update word vector.

Fasttext is different from the Word2vec of another point is the addition of ngram to divide the trick, the long word through the ngram cut into a few short words, so for the non-login words can also be cut out by the Ngram word vector into a word. Because Chinese words are mostly short, this will be more useful to English corpus than Chinese corpus.

In addition, Fasttext compared to the deep learning model has the advantage of a very fast training speed. We currently use Fasttext to carry out customer-filled order addresses to the town level category. Each province to establish a model, each model to be divided into categories have more than 1000 classes, about 2 million of the training data, 12 threads less than 1 minutes can be trained to complete, the final classification accuracy and model robustness is higher (county level classification correct accuracy is higher than 99.5%, the town level is higher than 98%), In particular, the abbreviation of place names, or the omission of the municipal administrative areas, district-level districts can also be handled correctly. parameter Aspects

The loss function uses HS (hierarchical Softmax) much faster than the NS (negative sampling) training, and the accuracy is higher.

Wordngrams default is 1, set to more than 2 can significantly improve the accuracy rate.

If the number of words is not many, you can set the bucket smaller, otherwise the reservation will reserve too many buckets to make the model too large.

[2] Because Facebook provided only the C + + version of the code, originally thought to encapsulate a python interface, the results on GitHub already has a packaged Python interface. It is particularly convenient to use, feel that it can not meet their own use requirements, modify the source code is also very convenient.

For the same text classification problem, but also with a one-way lstm done again, input pre-trained embedding word vector, and in training fine-tune, compared with fasttext, even using the GTX 980 GPU, training speed is still much slower , and the accuracy and accuracy of the fasttext are similar.

So for text classification, it is very suitable to make a simple baseline with Fasttext first.

[1]https://github.com/facebookresearch/fasttext
[2]https://github.com/salestock/fasttext.py

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Fasttext Text Classification Usage Experience

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Fasttext Text Classification Usage Experience

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support