Compared to "dictionary-based Analysis," machine learning "does not require a large number of annotated dictionaries, but requires a large number of tagged data, such as:
Or the following sentence, if its label is:
Quality of service-medium (total three levels, good, medium and poor)
╮ (╯-╰) ╭, which is machine learning, trains a model with a large number of tagged data,
Then you enter a comment to determine the label level
Ningxin Reviews National Day activities, with 62 credit card can be 6.2 yuan to buy a stamp with the UnionPay card marked ice cream,
There are three flavors of vanilla, chocolate and matcha, and I choose vanilla and have a strong taste.
In addition, any consumption can be 10 yuan to buy two, the head is not very big, but very delicious, not very sweet kind, do not feel greasy.
Tags: quality of service-medium
Naive Bayes 1, Bayes theorem
Assuming that for a data set, the random variable C represents the probability of a sample being Class C, F1 represents the probability of a characteristic occurrence of a test sample, and applies the basic Bayesian formula, as follows:
The above means that for a sample, when the feature F1 appears, the sample is divided into the conditional probabilities of Class C. So how do you use the formula to classify the test sample?
For example, if there is a test sample whose feature F1 appears (F1=1), then P is calculated (c=0| f1=1) and P (c=1| f1=1) of the probability value. The former is large, then the sample is considered to be 0 categories, the latter large, then divided into 1 categories.
There are several concepts that need to be well known in this announcement:
prior probability (Prior). P (c) is a priori probability of C, which can be calculated from the existing training concentration by calculating the proportion of samples divided into C categories.
Evidence (Evidence). the upper P (F1), which indicates the probability that the feature F1 appears for a test sample. It is also possible to derive the proportion of the total sample from the F1 feature corresponding to the training concentration.
likelihood (likelihood). that is, the P (f1| c), indicating the probability that a sample is F1 if it is known to be classified as Class C.
For multiple features, the Bayesian formula can be extended as follows:
There is a large sequence of likelihood values in the molecule. The calculation of these likelihood values is extremely painful when there are many features. What should I do now?
2, the concept of simplicity
To simplify the calculation, the naive Bayesian algorithm makes a hypothesis: "The simple thought that each characteristic is independent of each other ". In this way, the molecules on the formula are simplified into:
P (C) p (f1| C) P (f2| C) ... P (fn| C).
After this simplification, the calculation is much more convenient.
The hypothesis is that the individual features are independent, and it does seem to be a very unscientific hypothesis. Because in many cases, each feature is closely related. However, in the naïve Bayesian application practice, it shows that its work is quite good.
Second, since naive Bayes works by calculating P (c=0| F1 ... Fn) and P (c=1| F1 ... Fn), and take the maximum value of that as its classification. And the denominator of the two is identical. Therefore, we can omit the denominator calculation , which further simplifies the calculation process.
In addition, Bayesian formula derivation can be established in an important prophase, that is, each evidence (evidence) can not be 0. That is, for any feature fx,p (Fx) cannot be 0. It is possible to show that some features do not appear in the test set. Therefore, some small processing is usually done on the implementation, for example, all counts are +1 ( addition smoothing aDDitive smoothing, also known as Laplace smoothing Laplace smothing). And if it is smoothed by increasing the alpha of an adjustable parameter greater than 0, it is called Lidstone smoothing .
The classification of emotion based on naive Bayes
Raw data set, only 10 strips were drawn
Read data
Read the Excel file, using the Pandas Library's dataframe data type
Word segmentation
For each comment participle, participle of the same time to remove the stop word , get the following glossary
Each list is the one that corresponds to comment one by one
Statistics
What do we count here? Statistic two kinds of data
1. Number of comment levels
There are three levels that correspond to each other.
c0→ Good 2
c1→ in 3
c2→ Difference 5
2. The number of occurrences of each word in a sentence
Get a dictionary data
Evalation [2, 5, 3]
Half price [0, 5, 0]
Cost-effective [1, 1, 0]
Good [0, 2, 0]
·········
dissatisfaction [0, 1, 0]
Important [0, 1, 0]
Clear [0, 1, 0]
specific [0, 1, 0]
List coordinates after each word (feature): 0,1,2, respectively, for good, medium, and poor
After the above work is done, the model is trained, but the more data the more accurate
Test
such as entering a sentence
Reviews of Century Luen Wah (West Mall) A reputed international metropolis, the cashier's service attitude is poor to the extreme. UnionPay activity 30-10, can not even single.
Get results
c2-Poor
GitHub address for related code:
Http://t.cn/RKfemBM yszx11.cn zhengshu5.com
A detailed analysis of emotion based on naive Bayesian and the implementation of Python