Influence of Feature Word Selection Algorithm on text classification accuracy (5)

Source: Internet
Author: User

In the previous section (in section 4), we talked about using two methods to estimate p (t | CI) in the Classical probability framework ), the conclusion is that these two methods have no significant impact on the final accuracy. Next we will provide a cainiao's naiva probability framework.

This frameworkSituation where all words in the probability normalization word bag appear in the training document setStatus. That isP (t)The statistical information in the bag-of-words model is directly normalized and assumedP (C1) = P (C2) = 1/2 P (c | T)It is also directly calculated.

For example, the word bag contains three words {[housework: class1 :() () class2 :()] [Russia: class2 :(), ()] [healthy: class1 :( 2, 4), (3, 2)]}

So P (t = housework) = (3 + 1 + 1)/[(3 + 1 + 1) + (3 + 1 + 1) + (2 + 4)]

Note: (4) in the Classical probability model, p (t) is calculated by P (t | C). It is assumed that only p (t | C) can be obtained directly from the current corpus, other probabilities are derived from this probability.

In our assumptions, P (c | T) is also directly calculated. Another example is P (class = class1 | T = "Housework") = (3 + 1)/(3 + 1 + 1)

So is the probability model I assume that cainiao is reasonable? Will it cause lower accuracy. Let's talk about the experiment results.

Similarly, based on whether to calculate the word inArticleThere are two situations.

Case 1: N/A; Case 2: N/. (The above is the result of Case 1, and the following is the result of Case 2)

 

 

 

Under the probability framework, from the perspective of the final experimental results (average accuracy), there is no obvious deficiency compared with the classical probability framework in the paper and textbook. I am using the binary word graph + ViterbiAlgorithmI have used two probability estimation modes to estimate the probability, but the final result is not much different (F value is used in Word Segmentation, it seems that the change is only the gap between sub-locations, for example, from 91.001% to 91.002% ). In addition, the probability of computing conditions in the naive Bayes classification of the scattered man in Dongting district is as follows:Code


He uses the method of calculating the probability without the document weight. nxc indicates the number of articles that X appears in Class C, and NC indicates the total number of articles in the class. Obviously, this probability is calculated incorrectly !! In this case, P (X1 | C) + P (X2 | C) +... + P (XN | C )! = 1 (n is the total number of words in the dictionary ). The reason is that many words can appear in an article at the same time. However, this calculation method does not affect the final classification accuracy. When I was doing my experiment, I deliberately used this method to see the effect. The cause can be simulated using a small example. A + B + C = 1 If you increase A, B, and C by N times at the same time, although A + B + C is no longer equal to 1, A, B, the relationship between C and C is not the same as before. The original size is n times larger.

Natural language processing can be divided into two camps: Statistical School and rule School. The rule school's attack Statistics School is "luck" and its theoretical basis is not strict enough. It is estimated that the above small example shows the clues.

I have consulted some materials and almost all of them have modeling in (4). I have been wondering why modeling is necessary. Can other modeling methods make sense? Therefore, we have the modeling in (5). From the current test results, there is nothing wrong with the probability hypothesis in (5.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.