Text categorization based on Naive Bayes algorithm

Source: Internet
Author: User

theory

What is naive Bayesian algorithm?

Naive Bayesian classifier is a weak classifier based on Bayes theorem, and all naive Bayes classifiers assume that each characteristic of a sample is irrelevant to other characteristics. For example, if a fruit has a red, round, or roughly 3-inch diameter, the fruit can be judged to be an apple. Although these characteristics are interdependent or some characteristics are determined by other characteristics, the naive Bayesian classifier considers these properties to be independent of the probability distribution of whether the fruit is an apple.

Naive Bayesian classifier is easy to build, especially suitable for large datasets, and it is well known that this is an efficient classification method that outperforms many complex algorithms.

The Bayesian formula provides a computational posteriori probability P (x| Y) in the way:

which

    • P (c|x) is the probability that a sample (c, Target), (x, Attribute) is known. The probability of being called a posteriori.

    • P (c) is the probability of the sample "C". called a priori probability.

    • P (X|c) is known as the probability of the sample "X", the sample "C".

    • P (x) is the probability of the sample "X".

Classification process of naive Bayesian algorithm

Give an example. The following is a set of training datasets (to calculate the likelihood of playing) that weather and response target variables play. We need to classify the weather conditions to determine if the person can go out to play, here are the steps:

Step 1: Convert the data set into a frequency table;

Step 2: Calculate the probability of different weather out of play, and create a likelihood table, such as cloudy probability is 0.29;

Step 3: Use the Bayesian formula to calculate the posterior probability of each class, the highest data column is the result of the prediction.

Question: If it is sunny, this person can go out to play. Is this statement correct?

P (yes | sunny) =p (Sunny | yes) XP (yes)/P (Sunny)

Here, P (Sunny | yes) = 3/9 = 0.33,p (Sunny) = 5/14 = 0.36,p (yes) = 9/14 = 0.64

Now, P (is | clear) =0.33x0.64/0.36=0.60, with a higher probability.

Naive Bayes is suitable for predicting the probabilities of different classes based on various attributes, so it is widely used in text classification.

The advantages and disadvantages of naive Bayes

Advantages:

    • Simple and fast, good prediction performance;

    • If the condition of variable independence is established, compared with other classification methods such as logistic regression, naive Bayesian classifier has better performance and needs little training data.

    • Compared with the numerical variables, Naive Bayes classifier behaves better in the case of multiple categorical variables. If a numeric variable, a normal distribution hypothesis is required.

Disadvantages:

    • If the category of the categorical variable (test data set) is not observed in the training data lumped, then the model assigns a 0 (0) probability to it and will not be able to make predictions. This is often referred to as the "0 frequency". In order to solve this problem, we can use smoothing technique, Laplace estimation is one of the most basic techniques.

    • Naive Bayes is also known as Bad estimator, so its probability of output predict_proba should not be taken too seriously.

    • Another limitation of Naive Bayes is the hypothesis of independent prediction. In real life, this is almost impossible, and there are more or less mutual effects between the variables.

4 Applications of Naive Bayes

Real-time predictions: no doubt, naive Bayes soon.

multi-Class prediction: This algorithm is known for its multi-class predictive capabilities, so it can be used to predict the probability of multiple target variables.

text categorization/spam filtering/sentiment analysis: compared with other algorithms, naive Bayesian application mainly concentrates on text classification (more variable type and more independent), and has higher success rate. It is therefore widely used in spam filtering (spam detection) and sentiment analysis (users who identify positive and negative emotions on social media platforms).

recommender System: the naïve Bayesian classifier and collaborative filtering combine to filter out what users want to see and don't want to see.

Some black technologies about naive Bayesian classifier

Here are some small ways to improve the performance of naive Bayesian classifiers:

    • If the continuous feature is not normally distributed, we should use a variety of different methods to convert it to a normal distribution.

    • If the test dataset has a "0 frequency" problem, apply the smoothing technique "Laplace estimate" to correct the data set.

    • Deleting a recurring height-dependent feature may result in loss of frequency information and effects.

    • Naive Bayesian classification has limited choice in parameter adjustment. I suggest focusing on data preprocessing and feature selection.

    • You might want to apply some sort of combinatorial techniques such as ensembling, bagging, and boosting, but these methods do not help. Because their purpose is to reduce the variance, naive Bayes does not need to minimize the difference.

How to build a basic model of naive Bayes (Python and R)

There are 3 types of naive Bayesian models in the Scikit learn:

Gaussian model: Applies to multiple types of variables, assuming the characteristics conform to the Gaussian distribution.

Polynomial model: Used for discrete counting. If a word repeats itself in a sentence, we regard each of them as independent, so we count multiple times and the probability index appears on the second side.

Bernoulli model: If eigenvectors are binary (that is, 0 and 1), then this model is very useful. Unlike polynomial, Bernoulli sees multiple occurrences as only one occurrence, which is simpler and more convenient.

You can select the appropriate models from the above 3 models based on a specific data set. Here we take the Gaussian model as an example to talk about how to build:

  • Python
  •   #Import Library of Gaussian Naive Bayes model from  sklearn.naive_bayes import gaussiannb  import numpy as NP  #a Ssigning Predictor and Target variables  x= np.array ([[ -3,7],[1,5], [up], [ -2,0], [2,3], [ -4,0], [ -1,1], [], [ -2,2] , [2,7], [ -4,1], [ -2,7]])  Y = Np.array ([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])  #Create a Gaussian Classifier  mod el = Gaussiannb ()  # Train the model using the training sets   Model.fit (x, y)  #Predict Output   predicted= m Odel.predict ([[[1,2],[3,4]])  print predicted  Output: ([3,4])
  • R
  •   Require (e1071) #Holds the Naive Bayes Classifier  Train <-read.csv (File.choose ())  Test <-read.csv ( File.choose ())  #Make sure the target variable is an two-class classification problem only  levels (Train$item_fat _content)   model <-Naivebayes (item_fat_content~., data = Train)  class (model)   pred <-predict (model, Test)  table (pred)

      

Main flow of text categorization

    1. Acquisition of sample Data
    2. Sample data cleaning and participle
    3. Modeling and evaluation
    4. Deployment Run

R Implementation

Complete code and reports see another blog post

participle

A handy package: Chinese.misc. Description Document: Https://github.com/githubwwwjjj/chinese.misc

Examples of applications:

Library (Jiebard) library (NLP) library (CHINESE.MISC) library (Jiebar) Library (tm) library (MASS) library (klar) library ( e1071) # #样本数据textData <-teld.ml.rQuery ("NEWSINFORMATION4BDP") textdata$newscontent <-as.character (textdata$ Newscontent) #自定义分词器, will require special identification of the included word Mycutter<-worker (write=false) myword<-c ("xxx", "xx", "x", "xxx", "xx") new_                 User_word (Mycutter,myword) #去掉英文, Letter slimtextf<-function (x) {x1<-slim_text (x, mycutter = Mycutter, Rm_place = True, Rm_time = True, Rm_eng = true, Rm_alpha = T RUE, paste = TRUE) return (x1)}textdata$newscontent <-sapply (as.list (textdata$newscontent), Slimtex Tf,simplify = TRUE); Textdata$tflag <-sample (0:1,nrow (TextData), Replace=t,prob=c (0.7,0.3)) # Press 7:3 to split the training set and test set Textdata.train <-textdata[textdata$tflag==0,]textdata.test <-Textdata[textdata$tflag==1,]dim ( Textdata.train) Dim (textdata.test) Head (textdata.train) #生成文档-Entry matrix DTM (document term matrix) #stopPatteRN <-' (QUOT|APP|EV|KM|MODEL|SUV) ', Stop_pattern = stoppattern#%>% is a pipe function that sends its left-hand value to the right expression and acts as the first argument to the right-hand expression function, such as X <-y%>% F (z) is equivalent to X<-f (y,z) dtm.train<-textdata.train$newscontent%>% Corp_or_dtm (from= "V", type= "D",  Stop_word= "Jiebar", Mycutter=mycutter,control = List (Wordlengths=c (2,25)))%>% removesparseterms (0.99)%>% Output_dtm (DOC_NAME=TEXTDATA.TRAIN$SN) #removeSparseTerms的Sparse参数越小, the more sparse The matrix Dim (Dtm.train) # 737 382#WRITE.CSV ( Dtm.train, ' dtm.train10.csv ') dtm.test<-textdata.test$newscontent%>% Corp_or_dtm (from= "V", type= "D", Stop_ Word= "Jiebar", Mycutter=mycutter,control = List (Wordlengths=c (2,25)))%>% removesparseterms (0.99)%>% Output_ DTM (DOC_NAME=TEXTDATA.TEST$SN) Dim (dtm.test) # 334 386

  

Modeling

There are two packages available: e1071 and Klar

e1071 Example:

#训练朴素贝叶斯模型model <-Naivebayes (Dtm.train,as.factor (textdata.train$astate), Laplace = 1) #预测测试集pre <-predict ( Model,dtm.test) #pre <-predict ("raw") #输出概率, because two probabilities are too large, meaningless, not used (see the drawbacks of the theory section) #默认值type = C (" Class "," raw ")

  

Klar Example : Modeling times wrong, back with analysis instructions

#训练朴素贝叶斯模型model <-Naivebayes (Dtm.train,as.factor (textdata.train$astate), USEKERNEL=FALSE,FL = 1) #报错: Zero Variances for at least one class in variables: Porsche, BYD, Daimler, environmental, Hybrid, group, accelerate, develop ... #预测测试集pre <-predict (model,dtm.test )
Error: Zero variances for at least one class in variables: Porsche, BYD, Daimler, environmental, Hybrid, group, speed up, development .....
Error reason: variable with 0 variance
Error message corresponding Source:

Solution: If you must use this package, then modify the source code, kill the stop, let it continue to execute. Then compile the package yourself.

Evaluation

10 percent cross-validation was performed in the application case, and the 10-time average was taken.

For a description of the relevant indicators see https://www.cnblogs.com/xianhan/p/9277194.html

#性能评价performance <-prop.table (Table (textdata.test$astate,pre)) performance# accuracy rate: 4 and 5 predict the correct proportions **accuracyrate<-( performance[1,1]+performance[2,2]) Accuracyrate # 0.5174603->0.5434783# sensitivity (recall \ Recall): Actual 5, which predicts the correct proportions * * * * * Sensitivity <-performance[2,2]/(performance[2,1]+performance[2,2]) sensitivity #0.8791209->0.816092# Precision: Forecast 5 , which predicts the correct proportions **precision <-performance[2,2]/(performance[1,2]+performance[2,2]) Precision # 0.361991-> 0.3514851 #特异性: The actual 4, which predicts the correct proportions *specificity <-performance[1,1]/(performance[1,1]+performance[1,2]) specificity #0.3705357- >0.4425532

  

Reference

Chinese text Analysis Handy Tool R package chinese.misc Chinese description

Https://github.com/githubwwwjjj/chinese.misc

R language-Bayesian classifier-Text sentiment analysis

https://zhuanlan.zhihu.com/p/26735328

R Combat--Volkswagen Reviews-Halla Mountain (Deep Country Investment store) Comment on emotion analysis

https://mp.weixin.qq.com/s?__biz=MzUyNzU3MTgyOA==&mid=2247483690&idx=1&sn= 9b8ea4b159e5885c0b99f3529a25c488&chksm= Fa7ccf31cd0b4627d8f7321a392036686c1f59a0e365b3976c8ba4810a407f46b53cff4e63f0#rd

Text categorization based on Naive Bayes algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.