SCWS demonstrates how automatic text classification is implemented in a site.

Source: Internet
Author: User
This is the URL. I tried some text input and the matching accuracy is quite high. What is the principle of its implementation? Does it retrieve existing databases to match text? I have been searching the Internet for a long time and have not found any information about this. Where can I download the reference materials? This is the URL. I tried some text input and the matching accuracy is quite high. What is the principle of its implementation? Does it retrieve existing databases to match text? I have been searching the Internet for a long time and have not found any information about this. Where can I download the reference materials?


Solution:

SCWS is a dictionary of PHP code. This automatic classification is not open-source and I don't know how it works.

The possible ideas are:

SCWS performs text word segmentation and analysis after word segmentation. (A bit of noise may be required, for example, "of" or ".)

The simplest and most crude method is to maintain a dictionary of categories. There are a bunch of words under a certain category. If any of these words appears in the text, add 1 point, calculate the scores of each category and calculate the corresponding category.

If the above is too simple and crude, we can improve it:

Refine some rules. For example, in a classification dictionary, some words under a classification are more closely related to this classification than other words, and the weight is calculated. Words that appear at the beginning and end of an article have a higher weight and emphasize that the word weights in the structure are greater. For another example, if we are sure that some words are not likely to appear in the text of a specific category, once these words appear in the text, the probability of the text belonging to that category should be lowered.

In short, Word Segmentation turns text into a word set and then analyzes the text according to the rules.

Then the rules do not have to be written one by one, so that the machine can automatically "Discover ". This is statistical analysis. In general, we first have a bunch of text for classification as the standard, and then we use machines to analyze these classification templates to explore the relationship between words and classes. Then the machine applies the learning income to the new word set and calculates the class to which it belongs.

The entire process can be repeated, that is, if we feel that the calculation results are not accurate, we can adjust the parameters or something. If we feel that the results are accurate, we can add the results to the template. The more templates, the more accurate the machine will be.

As mentioned above, "mining the links between words and classes" actually involves two aspects. First, we need to find the words most closely related to classification, or, words that distinguish the categories they belong to are also features.

As mentioned above, words with high frequency appear more likely to distinguish articles. Computing Based on this idea is an algorithm. If we consider that words with a high frequency appear in all types of text, it is unlikely to distinguish articles (for example, "of", which are frequently used everywhere and is useless ). Then, we need to find words with high intermediate frequency and low frequency in all texts. This feature extraction algorithm is called "TF-IDF (term frequency-inverse document frequency )".

In addition, there are also IG, MI, CHI and other algorithms. As we mentioned above, "If we are sure that the text under a certain classification is unlikely to contain certain words," the idea like this is the intuition behind these three algorithms. IG (Information Gain) counts the probability that a word appears and does not appear in a text of a category. MI (Mutual Information) considers that if two words appear at the same time, or one word appears, and the other does not appear, it is more likely to be a type. (For example, if "machine" and "Programming" appear at the same time, it is more likely that "machine" and "machine" appear at the same time, if the word "programming" or "software" does not appear, it is unlikely to be a programming class .) On the basis of MI, plus, if there is a "horse species", it is unlikely that the setting of female's novel to YY is CHI (CHI-square test-the name comes from the CHI-square distribution ).

Well, the methods above are different. Some methods seem to be better, but the performance may be poor. Also, the perfect method requires parameter identification. If the parameter is incorrect, it may not be a simple method. Then, the above method does not consider the location of the word and there is room for improvement. For example, we have mentioned that the word weights at the beginning, the end, and the emphasis on the structure are important. This factor can also be included in the calculation, which is not in-depth here.

No matter which method is used, assume that we have extracted the features, and then we need to compare the similarity. The simplest way is to find n samples closest to the text to be classified, then, based on the classification of the n samples, the similarity of each sample can be weighted to calculate the closest class. This idea is KNN (K-Nearest Neighbor) algorithm.

There is also the NB (Na naive ve Bayes) algorithm, which is based on the Bayesian formula. And genetic algorithms. Because Bayesian formulas and genetic algorithms are too famous, we will not introduce them here. (It seems that scientific squirrels will have popular science articles .)

The above are some basic principles. For more information, see machine learning or data mining.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.