A detailed analysis of emotion based on naive Bayesian and the implementation of Python

Source: Internet
Author: User

Compared to "dictionary-based Analysis," machine learning "does not require a large number of annotated dictionaries, but requires a large number of tagged data, such as:

Or the following sentence, if its label is:

Quality of service-medium (total three levels, good, medium and poor)

╮ (╯-╰) ╭, which is machine learning, trains a model with a large number of tagged data,

Then you enter a comment to determine the label level

Ningxin Reviews National Day activities, with 62 credit card can be 6.2 yuan to buy a stamp with the UnionPay card marked ice cream,

There are three flavors of vanilla, chocolate and matcha, and I choose vanilla and have a strong taste.

In addition, any consumption can be 10 yuan to buy two, the head is not very big, but very delicious, not very sweet kind, do not feel greasy.

Tags: quality of service-medium

  

Naive Bayes 1, Bayes theorem

Assuming that for a data set, the random variable C represents the probability of a sample being Class C, F1 represents the probability of a characteristic occurrence of a test sample, and applies the basic Bayesian formula, as follows:

The above means that for a sample, when the feature F1 appears, the sample is divided into the conditional probabilities of Class C. So how do you use the formula to classify the test sample?

For example, if there is a test sample whose feature F1 appears (F1=1), then P is calculated (c=0| f1=1) and P (c=1| f1=1) of the probability value. The former is large, then the sample is considered to be 0 categories, the latter large, then divided into 1 categories.

There are several concepts that need to be well known in this announcement:

  prior probability (Prior). P (c) is a priori probability of C, which can be calculated from the existing training concentration by calculating the proportion of samples divided into C categories.

  Evidence (Evidence). the upper P (F1), which indicates the probability that the feature F1 appears for a test sample. It is also possible to derive the proportion of the total sample from the F1 feature corresponding to the training concentration.

  likelihood (likelihood). that is, the P (f1| c), indicating the probability that a sample is F1 if it is known to be classified as Class C.

For multiple features, the Bayesian formula can be extended as follows:

There is a large sequence of likelihood values in the molecule. The calculation of these likelihood values is extremely painful when there are many features. What should I do now?

  

2, the concept of simplicity

To simplify the calculation, the naive Bayesian algorithm makes a hypothesis: "The simple thought that each characteristic is independent of each other ". In this way, the molecules on the formula are simplified into:

P (C) p (f1| C) P (f2| C) ... P (fn| C).

After this simplification, the calculation is much more convenient.

The hypothesis is that the individual features are independent, and it does seem to be a very unscientific hypothesis. Because in many cases, each feature is closely related. However, in the naïve Bayesian application practice, it shows that its work is quite good.

Second, since naive Bayes works by calculating P (c=0| F1 ... Fn) and P (c=1| F1 ... Fn), and take the maximum value of that as its classification. And the denominator of the two is identical. Therefore, we can omit the denominator calculation , which further simplifies the calculation process.

In addition, Bayesian formula derivation can be established in an important prophase, that is, each evidence (evidence) can not be 0. That is, for any feature fx,p (Fx) cannot be 0. It is possible to show that some features do not appear in the test set. Therefore, some small processing is usually done on the implementation, for example, all counts are +1 ( addition smoothing aDDitive smoothing, also known as Laplace smoothing Laplace smothing). And if it is smoothed by increasing the alpha of an adjustable parameter greater than 0, it is called Lidstone smoothing .

  

The classification of emotion based on naive Bayes

Raw data set, only 10 strips were drawn

  

Read data

Read the Excel file, using the Pandas Library's dataframe data type

  

Word segmentation

For each comment participle, participle of the same time to remove the stop word , get the following glossary

Each list is the one that corresponds to comment one by one

  

Statistics

What do we count here? Statistic two kinds of data

1. Number of comment levels

There are three levels that correspond to each other.

c0→ Good 2

c1→ in 3

c2→ Difference 5

2. The number of occurrences of each word in a sentence

Get a dictionary data

Evalation [2, 5, 3]

Half price [0, 5, 0]

Cost-effective [1, 1, 0]

Good [0, 2, 0]

·········

dissatisfaction [0, 1, 0]

Important [0, 1, 0]

Clear [0, 1, 0]

specific [0, 1, 0]

List coordinates after each word (feature): 0,1,2, respectively, for good, medium, and poor

After the above work is done, the model is trained, but the more data the more accurate

  

Test

such as entering a sentence

Reviews of Century Luen Wah (West Mall) A reputed international metropolis, the cashier's service attitude is poor to the extreme. UnionPay activity 30-10, can not even single.

Get results

c2-Poor

GitHub address for related code:

Http://t.cn/RKfemBM yszx11.cn zhengshu5.com

A detailed analysis of emotion based on naive Bayesian and the implementation of Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.