Naïve Bayesian python Small sample example

Source: Internet
Author: User

Naive Bayesian
Pros: Still valid with less data, can handle multiple categories of problems
Cons: Sensitive to the way the input data is prepared
Applicable data type: Nominal type data
The core idea of naive Bayesian decision theory: Choosing the decision with the highest probability
The general process of naive Bayes
(1) Collect data: You can use any method.
(2) Prepare data: Numeric or Boolean data is required.
(3) Analysis of data: when there are a large number of features, the return value of the feature does not work, the use of histogram effect is better
(4) Training algorithm: Calculating the conditional probabilities of different independent and positive conditions
(5) Test algorithm: Calculate error rate
(6) Using algorithms: A common naive Bayesian application is document categorization. Naive Bayesian classifiers can be used in any classification scenario, not necessarily text

1  fromNumPyImport*2 3 #create some experimental samples. The first variable returned by the function is the collection of documents after the entry,4 #The second variable returned by the function is a collection of category labels5 defLoaddataset ():6postinglist=[['my','Dog',' has','Flea','problems',' Help',' please'],7['maybe',' not',' Take','him',' to','Dog','Park','Stupid'],8['my','dalmation',' is',' So','Cute','I',' Love','him'],9['Stop','Posting','Stupid','Worthless','Garbage'],Ten['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'], One['quit','Buying','Worthless','Dog',' Food','Stupid']] AClassvec = [0,1,0,1,0,1]#1 is abusive, 0 not -     returnPostinglist,classvec -  the #Create a list that contains the non-repeating words that appear in all documents - defcreatevocablist (dataSet): -     #Create an empty set -Vocabset = set ([])#Create empty Set +      forDocumentinchDataSet: -         #Create a set of two sets +Vocabset = Vocabset | Set (document)#Union of the sets A     returnlist (Vocabset) at  - #the input parameter of this function is the glossary and one of its documents, the output is the document vector, each element of the vector is 1 or 0, - #indicates whether the words in the glossary appear in the input document, respectively.  - #The function first creates a vector of the same length as the glossary and sets its elements to 0. Then, traversing all the words in the document, - #If a word appears in the glossary, the corresponding value in the output document vector is set to 1. If everything goes well, you don't need it. - #Check to see if a word is still in vocablist, which might be used in the rear in defSetofwords2vec (Vocablist, inputset): -     #Create a vector that has a dimension of 0 toReturnvec = [0]*Len (vocablist) +      forWordinchInputset: -         ifWordinchvocablist: theReturnvec[vocablist.index (word)] = 1 *         Else:Print("The Word:%s is isn't in my vocabulary!"%word) $     returnReturnvecPanax Notoginseng " " - the pseudo-code for this function is as follows: the count the number of documents in each category + for each training document: A for each category: the If the entry appears in the document, the count value of the entry is increased + increase the count value of all entries - for each category: $ for each entry: $ Divide the number of entries by the total number of entries to get conditional probabilities - returns the conditional probabilities for each category - " " the  - #Wuyi deftrainNB0 (trainmatrix,traincategory): theNumtraindocs =Len (Trainmatrix) -Numwords =Len (trainmatrix[0]) Wu     #initial probability of initialization -pabusive = SUM (traincategory)/float (numtraindocs) AboutP0num = ones (numwords); P1num = Ones (numwords)#Change to ones () $P0denom = 2.0; P1denom = 2.0#Change to 2.0 -      forIinchRange (Numtraindocs): -         #Vector addition -         ifTraincategory[i] = = 1: AP1num + =Trainmatrix[i] +P1denom + =sum (trainmatrix[i]) the         Else: -P0num + =Trainmatrix[i] $P0denom + =sum (trainmatrix[i]) the             #division of each element theP1vect = log (p1num/p1denom)#Change to log () theP0vect = log (p0num/p0denom)#Change to log () the     returnp0vect,p1vect,pabusive -  in defclassifynb (vec2classify, P0vec, P1vec, PClass1): theP1 = SUM (vec2classify * P1vec) + log (PCLASS1)#element-wise mult theP0 = SUM (vec2classify * P0vec) + log (1.0-PClass1) About     ifP1 >P0: the         return1 the     Else: the         return0 +  - defbagofwords2vecmn (Vocablist, inputset): theReturnvec = [0]*Len (vocablist)Bayi      forWordinchInputset: the         ifWordinchvocablist: theReturnvec[vocablist.index (word)] + = 1 -     returnReturnvec -  the defTESTINGNB (): theListoposts,listclasses =Loaddataset () theMyvocablist =createvocablist (listoposts) thetrainmat=[] -      forPostindocinchlistoposts: the trainmat.append (Setofwords2vec (Myvocablist, Postindoc)) theP0v,p1v,pab =trainNB0 (Array (trainmat), Array (listclasses)) theTestentry = [' Love','my','dalmation']94Thisdoc =Array (Setofwords2vec (Myvocablist, testentry)) the     Print(Testentry,'classified as:', CLASSIFYNB (thisdoc,p0v,p1v,pab)) theTestentry = ['Stupid','Garbage'] theThisdoc =Array (Setofwords2vec (Myvocablist, testentry))98     Print(Testentry,'classified as:', CLASSIFYNB (thisdoc,p0v,p1v,pab)) About #If a word appears in the document in more than one order, this may mean that there is some kind of information that the word does not appear in the document, - #This method is called the word bag model. 101 # #102 103Mysent='This was the best book on Python or M.L. I have ever laid eyes upon.'104A=Mysent.split () the PrintA

Summary:

For classification, the use of probabilities is sometimes more efficient than the use of hard rules. Bayesian probabilities and Bayesian criteria provide an effective method for estimating unknown probabilities using known values.

The need for data volume can be reduced by the assumption of the tuning independence between features. The independence hypothesis refers to the probability that the occurrence of a word does not depend on other words in the document. Of course, we also know that this hypothesis is too simplistic, which is why it is called Naive Bayes. Although the hypothesis of conditional independence is not correct, naive Bayes is still an effective classifier.

Naïve Bayesian python Small sample example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.