A tutorial on the machine learning of Bayesian classifier using python from zero _python

Source: Internet
Author: User
Tags pow square root in python

Naive Bayesian algorithm is simple and efficient, and it is one of the first ways to deal with classification problems.

With this tutorial, you'll learn the fundamentals of naive Bayesian algorithms and the step-by-step implementation of the Python version.

Update: View subsequent articles on naive Bayesian use tips "Better Naive bayes:12 tips to get the Most from the Naive Bayes algorithm"
Naive Bayes classifier, Matt Buck retains part of the copyright
about naive Bayesian

Naive Bayesian algorithm is an intuitive method, using the probability of each attribute to belong to a class to make predictions. You can use this kind of supervised learning method to model a predictive modeling problem.

Given a class, naive Bayes assumes that the probability that each attribute belongs to this class is independent of all remaining attributes, thus simplifying the calculation of probabilities. This strong assumption produces a fast, efficient method.

Given an attribute value, the probability of belonging to a class is called the conditional probability. For a given class value, multiply the conditional probability of each property to get the probability that a data sample belongs to a class.

We can calculate the probability of a sample being attributed to each class, and then choose the class with the highest probability to make predictions.

In general, we use categorical data to describe naive Bayes because it is easy to describe and compute by ratio. A more useful algorithm for our purposes requires the support of numeric attributes, assuming that each numeric attribute obeys a normal distribution (distributed on a bell curve), which is a strong hypothesis, but can still give a robust result.
predicting the onset of diabetes

The test question used in this article is "The Leather horse Indians diabetes Problem".

The problem includes 768 medical observation details for the skin-Indian patients, and the recorded instantaneous measurements are taken from the patient's age, pregnancy and the number of blood tests. All patients are women over 21 years of age (including 21 years old), all of which are numeric and have different units of properties.

Each record is attributed to a class that indicates whether the patient was infected with diabetes within 5 years, by the time the measurement was measured. If yes, then 1, or 0.

The standard dataset has been studied several times in the machine learning literature, with a good prediction accuracy of 70%-76%.

Here is a sample from the Pima-indians.data.csv file to find out what data we're going to use.

Note: Download the file and save it as a. csv extension (e.g., pima-indians-diabetes.data.csv). View a description of all the properties in the file.

 
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1

Simple Bayesian Algorithm Tutorial

The tutorial is divided into the following steps:

1. Process data: load data from a CSV file and divide it into a training set and a test set.

2. Extract Data features: extract the attribute characteristics of the training dataset so that we can calculate the probability and make predictions.

3. Single prediction: generating a single forecast using the characteristics of a dataset.

4. Multiple projections: generate predictions based on a given test dataset and a training dataset with extracted features.

5. Evaluation accuracy: Evaluate the predictive accuracy of the test data set as the predicted correctness rate.

6. Merge code: use all code to render a complete, independent naive Bayes algorithm implementation.

1. Processing data

The data file is loaded first. Data in CSV format has no header row and no quotes. We can use the Open function in the CSV module to open the file and read the row data using the reader function.

We also need to convert the attributes loaded in string type into the numbers we can use. The following is the Loadcsv () function used to load the horse Indian DataSet (Pima Indians dataset).

 
Import CSV
def loadcsv (filename):
  lines = csv.reader (open (filename, "RB"))
  DataSet = list (lines) for
  I In range (len (DataSet)):
    dataset[i] = [Float (x) for x in Dataset[i]] return
  dataset

We can test this function by loading the leather horse Indian dataset and then printing out the number of data samples.

 
filename = ' pima-indians-diabetes.data.csv '
dataset = loadcsv (filename)
print (' Loaded data file {0} with {1} Rows '). Format (filename, len (DataSet))

Run the test and you will see the following results:

 
Loaded data file iris.data.csv with rows

Next, we divide the data into a training dataset for naive Bayesian prediction, and a test dataset to evaluate the precision of the model. We need to randomly divide the dataset into a training set containing 67% and a test set containing 33% (this is the usual ratio of the algorithms to be tested on this dataset).

The following is the Splitdataset () function, which divides the dataset in a given partitioning ratio.

 
Import Random
def splitdataset (DataSet, Splitratio):
  trainsize = Int (len (DataSet) * splitratio)
  trainset = []
  copy = List (DataSet) while
  Len (trainset) < trainsize:
    index = random.randrange (len (copy))
    Trainset.append (Copy.pop (index)) return
  [trainset, copy]

We can define a dataset with 5 samples to test it, first dividing it into a training dataset and a test dataset, and then printing it to see which dataset each data sample eventually falls into.

 
DataSet = [[1], [2], [3], [4], [5]]]
splitratio = 0.67
train, test = Splitdataset (DataSet, Splitratio)
print (' Split {0} rows into train with {1} and test with {2} '). Format (len (DataSet), train, test)

Run the test and you will see the following results:

 
Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]]

Extracting Data features

The naive Bayesian model contains the characteristics of the data in the training dataset, and then uses this data feature to make predictions.

The characteristics of the collected training data, including the mean and standard deviation of each attribute relative to each class. For example, if there are 2 classes and 7 numeric properties, then we need the mean and standard deviation of the combination of each attribute (7) and Class (2), that is, 14 attribute characteristics.

These characteristics are used when calculating and predicting the probability of a particular attribute being attributed to each class.

We divide the acquisition of data features into the following subtasks:

Dividing data by category
Calculate the mean value
Calculate standard deviation
Extracting data set features
Extract attribute characteristics by category

Dividing data by category

First, the samples in the training dataset are divided into categories, then the statistics for each class are calculated. We can create a category to a map of the sample list that belongs to this category, and categorize the samples from the entire dataset into the appropriate list.

The following separatebyclass () function can accomplish this task:

 
def separatebyclass (DataSet):
  separated = {} for
  I in range (len (DataSet)):
    vector = dataset[i]
    if ( VECTOR[-1] [separated]:
      separated[vector[-1]] = []
    separated[vector[-1]].append (vector) return
  Separated

As you can see, a function assumes that the last property (-1) in the sample is a class value that returns a category value to the map of the Data sample list.

We can use some sample data to test the following:

 
DataSet = [[1,20,1], [2,21,0], [3,22,1]]
separated = Separatebyclass (DataSet)
print (' separated instances: {0 } '). Format (separated)

Run the test and you will see the following results:

 
Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}

Calculate the mean value

We need to calculate the mean value of each attribute in each class. The mean value is the midpoint or concentration trend of the data, and we use it as the median of the Gaussian distribution when calculating the probability.

We also need to compute the standard deviation for each attribute in each class. The standard deviation describes the deviation of the data dispersion and, in calculating probabilities, we use it to depict the desired dispersion of each attribute in the Gaussian distribution.

The standard deviation is the square root of the variance. Variance is the average of the square of deviations between each attribute value and the mean. Note that we use the N-1 method (see unbiased estimate), which means that the number of property values is reduced by 1 when the variance is computed.

 
Import Math
def mean (numbers): Return
  sum (numbers)/float (len (Numbers))
 
def stdev (numbers):
  avg = Mean (numbers)
  variance = SUM ([Pow (x-avg,2) for x into numbers])/float (len (Numbers)-1) return
  math.sqrt (variance )

The function is tested by calculating the mean value from 1 to 5 of these 5 numbers.

 
numbers = [1,2,3,4,5]
print (' Summary of {0}: Mean={1}, Stdev={2} '). Format (numbers, mean (numbers), STDEV (numbers))

Run the test and you will see the following results:

 
Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008

Extracting features from a dataset

Now we can extract the dataset feature. For a given list of samples (corresponding to a class), we can calculate the mean and standard deviation of each property.

The ZIP function groups data samples into lists by attributes, and then calculates the mean and standard deviation for each property.

 
Def summarize (DataSet):
  summaries = [(mean (attribute), Stdev (attribute)) for attribute in Zip (*dataset)]
  del SUMMARIES[-1] return
  summaries

We can use some test data to test this summarize () function, and the test data shows a significant difference between the mean and standard deviation of the first and second data attributes.

 
DataSet = [[1,20,0], [2,21,1], [3,22,0]]
summary = summarize (DataSet)
print (' Attribute summaries: {0} '). Format (summary)

Run the test and you will see the following results:

 
Attribute summaries: [(2.0, 1.0), (21.0, 1.0)]

Extract attribute characteristics by category

Merging code, we first divide the training dataset by category, and then compute a summary of each property.

 
def summarizebyclass (DataSet):
  separated = Separatebyclass (DataSet)
  summaries = {}
  for Classvalue, Instances in Separated.iteritems ():
    Summaries[classvalue] = summarize (instances) return
  summaries

Test the Summarizebyclass () function with a small test dataset.

 
DataSet = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]]
summary = summarizebyclass (DataSet)
print (' Summary by class Value: {0} '). Format (Summary)

Run the test and you will see the following results:

 
Summary by class value:
{0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)],
1: [2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]}

Forecast

We can now use the summary from the training data to make predictions. Forecasting involves calculating the probability of attribution to each class for a given sample of data, and then selecting the class with the maximum probability as the predicted result.

We can divide this part into the following tasks:

Calculation of Gauss probability density function
Calculate the probability of the corresponding class
Single forecast
Evaluation accuracy

Calculation of Gauss probability density function

Given the mean and standard deviation of known attributes from the training data, we can use the Gauss function to evaluate the probability of a given property value.

Given the property characteristics of each property and class value, the conditional probability of a given property value can be obtained under the condition of a class value.

For the Gauss probability density function, the reference can be viewed. In short, we want to incorporate the known details into the Gaussian function (attribute value, mean value, standard deviation), and get the likelihood that the attribute value belongs to a class: that is, the possibility.

In the Calculateprobability () function, we first compute the exponent portion and then compute the backbone of the equation. This can be well organized into 2 lines.

 
Import Math
def calculateprobability (x, Mean, STDEV):
  exponent = math.exp (-(Math.pow (x-mean,2)/(2*math.pow ( stdev,2))) return
  (1/(MATH.SQRT (2*MATH.PI) * stdev)) * exponent

Use some simple data to test the following:

 
x = 71.5
mean =
Stdev = 6.2
probability = calculateprobability (x, mean, Stdev)
print (' probability of Be Longing to this class: {0} '). Format (probability)

Run the test and you will see the following results:

Probability of belonging to this class:0.0624896575937

Calculate the probability of the owning class

Since we can calculate the probability that an attribute belongs to a class, the probability of merging all the attributes in a data sample, and finally the probability that the entire data sample belongs to a class.

Using the multiplication merging probability, in the following calculclassprobilities () function, given a data sample, the probability that it belongs to each category can be obtained by multiplying its attribute probability. The result is a mapping of a class value to a probability.

 
def calculateclassprobabilities (summaries, inputvector):
  probabilities = {}
  for Classvalue, classsummaries in Summaries.iteritems ():
    Probabilities[classvalue] = 1
    for i in range (len (classsummaries)):
      mean, Stdev = Classsummaries[i]
      x = inputvector[i]
      probabilities[classvalue] *= calculateprobability (x, mean, STDEV)
  return probabilities

Test the Calculateclassprobabilities () function.

 
summaries = {0:[(1, 0.5)], 1:[(5.0)]}
inputvector = [1.1, '? ']
probabilities = calculateclassprobabilities (summaries, inputvector)
print (' probabilities for each class: {0} '). Format (probabilities)

Run the test and you will see the following results:

Probabilities for each class: {0:0.7820853879509118, 1:6.298736258150442e-05}

Single forecast

Now that we can calculate the probability that a data sample belongs to each class, we can find the maximum probability value and return the associated class.

The following predict () function can complete the above task.

 
def predict (summaries, inputvector):
  probabilities = calculateclassprobabilities (summaries, inputVector)
  Bestlabel, Bestprob = None,-1
  for Classvalue, probability in Probabilities.iteritems ():
    if Bestlabel is None or P Robability > Bestprob:
      bestprob = probability
      Bestlabel = classvalue return
  Bestlabel

The test predict () function is as follows:

summaries = {' A ': [(1, 0.5)], ' B ': [(5.0)]}
inputvector = [1.1, '? ']
result = predict (summaries, inputvector)
print (' prediction: {0} '). Format (Result)

Run the test and you will get the following results:

Prediction:a

Multiple projections

Finally, we can evaluate the model precision by predicting each data sample in the test data set. The GetPredictions () function enables this function and returns a list of predictions for each test sample.

 
def getpredictions (summaries, testset):
  predictions = [] for
  I in range (len (testset)): Result
    = Predict ( Summaries, testset[i])
    predictions.append (Result) return
  predictions

The test getpredictions () function is as follows.

 
summaries = {' A ': [(1, 0.5)], ' B ': [(5.0)]}
testset = [[1.1, '? '], [19.1, '? ']]
predictions = getpredictions (summaries, testset)
print (' predictions: {0} '). Format (predictions)

Run the test and you will see the following results:

 
Predictions: [' A ', ' B ']

Calculation precision

The predicted values are compared with the class values in the test dataset to calculate a precision that is classified as a 0%~100% precision rate. The Getaccuracy () function can calculate this exact rate.

 
def getaccuracy (Testset, predictions):
  correct = 0 for
  x in range (len (testset)):
    if testset[x][-1] = = PREDICTIONS[X]:
      correct + 1 return
  (Correct/float (len (testset)) * 100.0

We can use the following simple code to test the Getaccuracy () function.

 
Testset = [[1,1,1, ' a '], [2,2,2, ' a '], [3,3,3, ' B ']]
predictions = [' A ', ' a ', ' a ', ' a ']
accuracy = getaccuracy (Testset, p redictions)
print (' accuracy: {0} '). Format (accuracy)

Run the test and you will get the following results:

accuracy:66.6666666667

Merging code

Finally, we need to make the code coherent.

Here is the full code for the step-by-step implementation of the naive Bayesian python version.

# Example of Naive Bayes implemented from Scratch in Python import CSV import random import math def loadcsv (filename): Lines = csv.reader (open (filename, rb)) DataSet = list (lines) for I in Range (len (DataSet)): Dataset[i] = [Floa T (x) for x in Dataset[i]] return DataSet def splitdataset (DataSet, splitratio): trainsize = Int (len (DataSet) * Split Ratio) trainset = [] copy = List (DataSet) while Len (trainset) < Trainsize:index = Random.randrange (len (copy) ) Trainset.append (Copy.pop (index)) return [trainset, copy] def separatebyclass (DataSet): separated = {} for I
    In range (len (DataSet)): vector = Dataset[i] if (vector[-1] not in separated): Separated[vector[-1] []
 
Separated[vector[-1]].append (Vector) return separated def mean (numbers): return sum (Numbers)/float (len (numbers))  def stdev (numbers): avg = mean (numbers) variance = SUM ([Pow (x-avg,2) for x in numbers])/float (len (Numbers)-1) return MATH.SQRT (variance) dEF Summarize (DataSet): summaries = [(mean (attribute), Stdev (attribute)) for attribute in Zip (*dataset)] del summaries[ -1] return summaries def summarizebyclass (DataSet): separated = Separatebyclass (DataSet) summaries = {} for Clas Svalue, instances in Separated.iteritems (): summaries[classvalue] = summarize (instances) return summaries def Calc Ulateprobability (x, Mean, STDEV): exponent = Math.exp (-(Math.pow (x-mean,2)/(2*math.pow (stdev,2))) return (1/(MATH.S QRT (2*MATH.PI) * stdev)) * Exponent def calculateclassprobabilities (summaries, inputvector): probabilities = {} for Classvalue, classsummaries in Summaries.iteritems (): probabilities[classvalue] = 1 for i in range (Len (Classsummari ES): mean, Stdev = classsummaries[i] x = inputvector[i] Probabilities[classvalue] *= Calculateprobabili Ty (x, mean, Stdev) return probabilities def predict (summaries, inputvector): probabilities = Calculateclassprobabili Ties (summaries, InputvecTor) Bestlabel, Bestprob = None,-1 for Classvalue, probability in Probabilities.iteritems (): If Bestlabel is None or probability > bestprob:bestprob = probability Bestlabel = classvalue return Bestlabel def Getpredi 
    Ctions (summaries, testset): predictions = [] for i in range (len (testset)): result = predict (summaries, testset[i]) Predictions.append (Result) return predictions def getaccuracy (Testset, predictions): correct = 0 for I in ran GE (len (testset)): if testset[i][-1] = = Predictions[i]: Correct + 1 return (Correct/float (Len (testset))) * 100 .0 def main (): filename = ' pima-indians-diabetes.data.csv ' splitratio = 0.67 DataSet = loadcsv (filename) trainin Gset, Testset = Splitdataset (DataSet, Splitratio) print (' Split {0} rows into Train={1} and test={2} rows '). Format (Len (DA Taset), Len (Trainingset), Len (Testset)) # Prepare model summaries = Summarizebyclass (trainingset) # test model Pre Dictions = GetpredictionS (summaries, testset) accuracy = getaccuracy (testset, predictions) print (' accuracy: {0}% '). Format (accuracy) main ()

Run the example and get the following output:

Split 768 rows into train=514 and test=254 rows
accuracy:76.3779527559%

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.