A machine learning tutorial using Python to implement Bayesian classifier from scratch, python bayesian

Source: Internet
Author: User

A machine learning tutorial using Python to implement Bayesian classifier from scratch, python bayesian

The naive Bayes algorithm is simple and efficient. It is one of the first methods to deal with classification issues.

In this tutorial, you will learn the principles of the naive Bayes algorithm and the gradual implementation of the Python version.

Update: see The subsequent article "Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm"
Naive Bayes classifier, which is reserved by Matt Buck.
Naive Bayes

Naive Bayes is an intuitive method that uses the probability that each attribute belongs to a class for prediction. You can use this supervised learning method to model the probability of a predictive modeling problem.

Given a class, Naive Bayes assumes that the probability of each attribute belonging to this class is independent of all other attributes, thus simplifying the probability calculation. This strong assumption produces a fast and effective method.

Given an attribute value, the probability of a class is called conditional probability. For a given class value, the conditional probability of each attribute is multiplied to obtain the probability that a data sample belongs to a class.

We can calculate the probability that the sample belongs to each class, and then select the class with the highest probability for prediction.

Generally, we use classification data to describe Naive Bayes, because it is easy to describe and calculate by ratio. A useful algorithm that meets our purpose needs to support Numerical Attributes. It is also a strong assumption that each numeric attribute follows a normal distribution (distributed on a bell-shaped curve, however, it can still give a robust result.
Predict the occurrence of Diabetes

The test question used in this article is "the diabetes problem in the Pima Indians ".

The problem included 768 medical observations of patients with Hema and India, recording the number of instantaneous measurements described taken from patients such as age, pregnancy, and blood tests. All patients are women over 21 years old (including 21 years old), all attributes are numeric, and the attribute units are different.

Each record belongs to a class that specifies whether the patient is infected with diabetes within 5 years as of the measurement time. If yes, it is 1; otherwise, it is 0.

This Standard dataset has been studied many times in the Machine Learning literature, and the prediction accuracy is 70%-76%.

The following is a sample in the pima-indians.data.csv file to learn about the data we will use.

Note: download the file and save it with the. CSV extension (for example, pima-indians-diabetes.data.csv ). View the description of all attributes in the file.

 6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1

Naive Bayes algorithm tutorial

The tutorial consists of the following steps:

1. process data:Load data from a CSV file and divide it into a training set and a test set.

2. extract data features:Extract the attribute features of the Training dataset so that we can calculate the probability and make predictions.

3. Single prediction:Generate a single prediction using the features of the dataset.

4. Multiple predictions:Generates a prediction based on a given test dataset and a trained dataset with extracted features.

5. evaluation accuracy:Evaluate the prediction accuracy of the test dataset as the prediction accuracy.

6. Merge code:Use all the code to present a complete and independent implementation of the Naive Bayes algorithm.

1. Process Data

Load the data file first. The data in CSV format has no header row or any quotation marks. We can use the open function in the csv module to open the file and use the reader function to read row data.

We also need to convert the property loaded as a string to a number that we can use. The following is the loadCsv () function used to load Pima indians dataset.

 import csvdef loadCsv(filename):  lines = csv.reader(open(filename, "rb"))  dataset = list(lines)  for i in range(len(dataset)):    dataset[i] = [float(x) for x in dataset[i]]  return dataset

We can test this function by loading the Pima Indian dataset and then printing the number of data samples.

 filename = 'pima-indians-diabetes.data.csv'dataset = loadCsv(filename)print('Loaded data file {0} with {1} rows').format(filename, len(dataset))

Run the test and you will see the following results:

 Loaded data file iris.data.csv with 150 rows

Next, we divide the data into training datasets for Naive Bayes prediction and test datasets for evaluating model accuracy. We need to randomly divide the dataset into a training set containing 67% and a test set containing 33% (this is the normal ratio of the Testing Algorithm on this dataset ).

The following is the splitDataset () function, which divides the dataset according to the given proportion.

 import randomdef splitDataset(dataset, splitRatio):  trainSize = int(len(dataset) * splitRatio)  trainSet = []  copy = list(dataset)  while len(trainSet) < trainSize:    index = random.randrange(len(copy))    trainSet.append(copy.pop(index))  return [trainSet, copy]

We can define a dataset with five samples for testing. First, it is divided into a training dataset and a test dataset, and then printed out to see which dataset each data sample finally falls.

 dataset = [[1], [2], [3], [4], [5]]splitRatio = 0.67train, test = splitDataset(dataset, splitRatio)print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test)

Run the test and you will see the following results:

 Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]]

Extract data features

The naive Bayes model contains the features of the data in the training dataset, and then uses the data features for prediction.

The features of the collected training data include the mean and standard deviation of each attribute relative to each class. For example, if there are two classes and seven numeric attributes, then we need the mean and standard deviation of the combination of each attribute (7) and Class (2, that is, 14 attribute features.

These features are used to calculate and predict the probability of a specific attribute belonging to each class.

We divide data feature acquisition into the following subtasks:

Divide data by category
Calculate Mean Value
Calculate Standard Deviation
Extract dataset features
Extract attribute features by category

Divide data by category

First, the samples in the training dataset are divided by category, and then the statistical data of each category is calculated. We can create a ing from a category to a list of samples of this category and classify the samples in the entire dataset to the corresponding list.

The following SeparateByClass () function can complete this task:

 def separateByClass(dataset):  separated = {}  for i in range(len(dataset)):    vector = dataset[i]    if (vector[-1] not in separated):      separated[vector[-1]] = []    separated[vector[-1]].append(vector)  return separated

It can be seen that the function assumes that the last attribute (-1) in the sample is the category value, and returns a ing between the category value and the data sample list.

We can use some sample data for testing as follows:

 dataset = [[1,20,1], [2,21,0], [3,22,1]]separated = separateByClass(dataset)print('Separated instances: {0}').format(separated)

Run the test and you will see the following results:

 Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}

Calculate Mean Value

We need to calculate the mean value of each attribute in each class. The mean value is the midpoint or concentrated trend of the data. When calculating the probability, we use it as the mean value of the Gaussian distribution.

We also need to calculate the standard deviation of each attribute in each class. The standard deviation describes the deviation of data distribution. When calculating the probability, we use it to portray the expected distribution of each attribute in the Gaussian distribution.

The standard deviation is the square root of the variance. Variance is the mean of the deviation square between each attribute value and the mean value. Note that we use the method of N-1 (see unbiased estimation), that is, when calculating variance, the number of attribute values is reduced by 1.

 import mathdef mean(numbers):  return sum(numbers)/float(len(numbers)) def stdev(numbers):  avg = mean(numbers)  variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)  return math.sqrt(variance)

Test the function by calculating the mean value of the five numbers from 1 to 5.

 numbers = [1,2,3,4,5]print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers))

Run the test and you will see the following results:

 Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008

Extract dataset features

Now we can extract dataset features. For a given sample list (corresponding to a class), we can calculate the mean and standard deviation of each attribute.

The zip function groups data samples by attributes into a list, and then calculates the mean and standard deviation for each attribute.

 def summarize(dataset):  summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]  del summaries[-1]  return summaries

We can use some test data to test this summarize () function. The test data shows a significant difference between the mean and standard deviation of the first and second data attributes.

 dataset = [[1,20,0], [2,21,1], [3,22,0]]summary = summarize(dataset)print('Attribute summaries: {0}').format(summary)

Run the test and you will see the following results:

 Attribute summaries: [(2.0, 1.0), (21.0, 1.0)]

Extract attribute features by category

To merge the code, we first divide the training dataset by category, and then calculate the summary of each attribute.

 def summarizeByClass(dataset):  separated = separateByClass(dataset)  summaries = {}  for classValue, instances in separated.iteritems():    summaries[classValue] = summarize(instances)  return summaries

Use a small test dataset to test the summarizeByClass () function.

 dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]]summary = summarizeByClass(dataset)print('Summary by class value: {0}').format(summary)

Run the test and you will see the following results:

 Summary by class value:{0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)],1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]}

Prediction

We can now use the summary obtained from the training data for prediction. Prediction involves calculating the probability that a given data sample belongs to each class, and then selecting the class with the highest probability as the prediction result.

We can divide this part into the following tasks:

Calculate Gaussian probability density function
Calculate the probability of the corresponding class
Single Prediction
Evaluation Accuracy

Calculate Gaussian probability density function

Given the mean and standard deviation of known attributes from the training data, we can use Gaussian Functions to evaluate the probability of a given attribute value.

When the attribute features of each attribute and class value are known, the conditional probability of the given attribute value can be obtained under the condition of a given class value.

For more information about Gaussian probability density functions, see references. In short, we need to integrate known details into Gaussian Functions (attribute values, mean values, and standard deviations) and obtain the likelihood of attribute values belonging to a class ).

In the calculateProbability () function, we first calculate the exponent, and then calculate the backbone of the equation. In this way, it can be well organized into two rows.

 import mathdef calculateProbability(x, mean, stdev):  exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))  return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

Some simple data tests are as follows:

 x = 71.5mean = 73stdev = 6.2probability = calculateProbability(x, mean, stdev)print('Probability of belonging to this class: {0}').format(probability)

Run the test and you will see the following results:
 

Probability of belonging to this class: 0.0624896575937

Calculate the probability of a class

Since we can calculate the probability that an attribute belongs to a class, we can combine the probability of all attributes in a data sample to get the probability that the entire data sample belongs to a class.

Using the multiplication probability, In the calculClassProbilities () function below, the probability of each category of a given data sample can be obtained by multiplying the probability of its attribute. The result is a ing of class values to probabilities.

 def calculateClassProbabilities(summaries, inputVector):  probabilities = {}  for classValue, classSummaries in summaries.iteritems():    probabilities[classValue] = 1    for i in range(len(classSummaries)):      mean, stdev = classSummaries[i]      x = inputVector[i]      probabilities[classValue] *= calculateProbability(x, mean, stdev)  return probabilities

Test the calculateClassProbabilities () function.

 summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]}inputVector = [1.1, '?']probabilities = calculateClassProbabilities(summaries, inputVector)print('Probabilities for each class: {0}').format(probabilities)

Run the test and you will see the following results:

Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}

Single Prediction

Since we can calculate the probability that a data sample belongs to each class, we can find the maximum probability value and return the associated class.

The following predict () function can complete the preceding tasks.

 def predict(summaries, inputVector):  probabilities = calculateClassProbabilities(summaries, inputVector)  bestLabel, bestProb = None, -1  for classValue, probability in probabilities.iteritems():    if bestLabel is None or probability > bestProb:      bestProb = probability      bestLabel = classValue  return bestLabel

Test the predict () function as follows:

summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}inputVector = [1.1, '?']result = predict(summaries, inputVector)print('Prediction: {0}').format(result)

Run the test and you will get the following results:

Prediction: A

Multiple predictions

Finally, we can evaluate the model accuracy by predicting each data sample in the test dataset. The getPredictions () function can implement this function and return the prediction list of each test sample.

 def getPredictions(summaries, testSet):  predictions = []  for i in range(len(testSet)):    result = predict(summaries, testSet[i])    predictions.append(result)  return predictions

Test the getPredictions () function as follows.

 summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]}testSet = [[1.1, '?'], [19.1, '?']]predictions = getPredictions(summaries, testSet)print('Predictions: {0}').format(predictions)

Run the test and you will see the following results:

 Predictions: ['A', 'B']

Computing accuracy

Compare the predicted value with the category value in the test dataset to obtain a value ranging from 0% ~ 100% accuracy as the classification accuracy. The getAccuracy () function can calculate this accuracy.

 def getAccuracy(testSet, predictions):  correct = 0  for x in range(len(testSet)):    if testSet[x][-1] == predictions[x]:      correct += 1  return (correct/float(len(testSet))) * 100.0

We can use the following simple code to test the getAccuracy () function.

 testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]predictions = ['a', 'a', 'a']accuracy = getAccuracy(testSet, predictions)print('Accuracy: {0}').format(accuracy)

Run the test and you will get the following results:

Accuracy: 66.6666666667

Merge code

Finally, we need to concatenate the code.

The following is all the code for the Python version of Naive Bayes.

# Example of Naive Bayes implemented from Scratch in Pythonimport csvimport randomimport math def loadCsv(filename):  lines = csv.reader(open(filename, "rb"))  dataset = list(lines)  for i in range(len(dataset)):    dataset[i] = [float(x) for x in dataset[i]]  return dataset def splitDataset(dataset, splitRatio):  trainSize = int(len(dataset) * splitRatio)  trainSet = []  copy = list(dataset)  while len(trainSet) < trainSize:    index = random.randrange(len(copy))    trainSet.append(copy.pop(index))  return [trainSet, copy] def separateByClass(dataset):  separated = {}  for i in range(len(dataset)):    vector = dataset[i]    if (vector[-1] not in separated):      separated[vector[-1]] = []    separated[vector[-1]].append(vector)  return separated def mean(numbers):  return sum(numbers)/float(len(numbers)) def stdev(numbers):  avg = mean(numbers)  variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)  return math.sqrt(variance) def summarize(dataset):  summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]  del summaries[-1]  return summaries def summarizeByClass(dataset):  separated = separateByClass(dataset)  summaries = {}  for classValue, instances in separated.iteritems():    summaries[classValue] = summarize(instances)  return summaries def calculateProbability(x, mean, stdev):  exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))  return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent def calculateClassProbabilities(summaries, inputVector):  probabilities = {}  for classValue, classSummaries in summaries.iteritems():    probabilities[classValue] = 1    for i in range(len(classSummaries)):      mean, stdev = classSummaries[i]      x = inputVector[i]      probabilities[classValue] *= calculateProbability(x, mean, stdev)  return probabilities def predict(summaries, inputVector):  probabilities = calculateClassProbabilities(summaries, inputVector)  bestLabel, bestProb = None, -1  for classValue, probability in probabilities.iteritems():    if bestLabel is None or probability > bestProb:      bestProb = probability      bestLabel = classValue  return bestLabel def getPredictions(summaries, testSet):  predictions = []  for i in range(len(testSet)):    result = predict(summaries, testSet[i])    predictions.append(result)  return predictions def getAccuracy(testSet, predictions):  correct = 0  for i in range(len(testSet)):    if testSet[i][-1] == predictions[i]:      correct += 1  return (correct/float(len(testSet))) * 100.0 def main():  filename = 'pima-indians-diabetes.data.csv'  splitRatio = 0.67  dataset = loadCsv(filename)  trainingSet, testSet = splitDataset(dataset, splitRatio)  print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))  # prepare model  summaries = summarizeByClass(trainingSet)  # test model  predictions = getPredictions(summaries, testSet)  accuracy = getAccuracy(testSet, predictions)  print('Accuracy: {0}%').format(accuracy) main()

Run the example and get the following output:
 

Split 768 rows into train=514 and test=254 rowsAccuracy: 76.3779527559%

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.