Influence of Feature Word Selection Algorithm on text classification accuracy (5)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous section (in section 4), we talked about using two methods to estimate p (t | CI) in the Classical probability framework ), the conclusion is that these two methods have no significant impact on the final accuracy. Next we will provide a cainiao's naiva probability framework.

This frameworkSituation where all words in the probability normalization word bag appear in the training document setStatus. That isP (t)The statistical information in the bag-of-words model is directly normalized and assumedP (C1) = P (C2) = 1/2 P (c | T)It is also directly calculated.

For example, the word bag contains three words {[housework: class1 :() () class2 :()] [Russia: class2 :(), ()] [healthy: class1 :( 2, 4), (3, 2)]}

So P (t = housework) = (3 + 1 + 1)/[(3 + 1 + 1) + (3 + 1 + 1) + (2 + 4)]

Note: (4) in the Classical probability model, p (t) is calculated by P (t | C). It is assumed that only p (t | C) can be obtained directly from the current corpus, other probabilities are derived from this probability.

In our assumptions, P (c | T) is also directly calculated. Another example is P (class = class1 | T = "Housework") = (3 + 1)/(3 + 1 + 1)

So is the probability model I assume that cainiao is reasonable? Will it cause lower accuracy. Let's talk about the experiment results.

Similarly, based on whether to calculate the word inArticleThere are two situations.

Case 1: N/A; Case 2: N/. (The above is the result of Case 1, and the following is the result of Case 2)

Under the probability framework, from the perspective of the final experimental results (average accuracy), there is no obvious deficiency compared with the classical probability framework in the paper and textbook. I am using the binary word graph + ViterbiAlgorithmI have used two probability estimation modes to estimate the probability, but the final result is not much different (F value is used in Word Segmentation, it seems that the change is only the gap between sub-locations, for example, from 91.001% to 91.002% ). In addition, the probability of computing conditions in the naive Bayes classification of the scattered man in Dongting district is as follows:Code

He uses the method of calculating the probability without the document weight. nxc indicates the number of articles that X appears in Class C, and NC indicates the total number of articles in the class. Obviously, this probability is calculated incorrectly !! In this case, P (X1 | C) + P (X2 | C) +... + P (XN | C )! = 1 (n is the total number of words in the dictionary ). The reason is that many words can appear in an article at the same time. However, this calculation method does not affect the final classification accuracy. When I was doing my experiment, I deliberately used this method to see the effect. The cause can be simulated using a small example. A + B + C = 1 If you increase A, B, and C by N times at the same time, although A + B + C is no longer equal to 1, A, B, the relationship between C and C is not the same as before. The original size is n times larger.

Natural language processing can be divided into two camps: Statistical School and rule School. The rule school's attack Statistics School is "luck" and its theoretical basis is not strict enough. It is estimated that the above small example shows the clues.

I have consulted some materials and almost all of them have modeling in (4). I have been wondering why modeling is necessary. Can other modeling methods make sense? Therefore, we have the modeling in (5). From the current test results, there is nothing wrong with the probability hypothesis in (5.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Influence of Feature Word Selection Algorithm on text classification accuracy (5)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Influence of Feature Word Selection Algorithm on text classification accuracy (5)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support