Machine learning: Bayesian classifier (ii)--Gaussian naive Bayesian classifier code implementation

Source: Internet
Author: User
Implementation of a Gaussian naive Bayesian classifier code
    • On-line search does not call Sklearn implementation of naïve Bayesian classifier is very few, even if there is also a combination of text classification of the polynomial or Bernoulli type, so I wrote a direct encapsulation of the Gaussian type NB classifier, of course, compared with the real source code a lot less properties and methods, interested can add their own. The code is as follows (with detailed comments):
Class Naivebayes (): "Gaussian naive Bayesian classifier" Def __init__ (self): Self._x_train = None Self._y_train = None Self._classes = None Self._priorlist = None Self._meanmat = None Self._varmat = None def fit (s Elf, X_train, y_train): Self._x_train = X_train Self._y_train = Y_train self._classes = np.un        Ique (Self._y_train) # Gets the various categories priorlist = [] Meanmat0 = Np.array ([[0, 0, 0, 0]])            Varmat0 = Np.array ([[[0, 0, 0, 0]]) for I, C in enumerate (self._classes): # Calculates the mean, variance, prior probability of each species X_index_c = Self._x_train[np.where (Self._y_train = = c)] # "Matrix" consisting of samples belonging to a category Priorlist.append (X_index_c     . shape[0]/self._x_train.shape[0]) # Calculate a priori probability of a category X_index_c_mean = Np.mean (X_index_c, axis=0, Keepdims=true)            # calculates the mean of each feature under this category, resulting in a two-dimensional state [[3 4 6 2 1]] X_index_c_var = Np.var (X_index_c, Axis=0, keepdims=true) # Variance Meanmat0 = Np.appEnd (Meanmat0, X_index_c_mean, axis=0) # The characteristic mean matrix under each category is a new matrix, each representing a category.  Varmat0 = Np.append (Varmat0, X_index_c_var, axis=0) self._priorlist = priorlist Self._meanmat = meanmat0[1:,                :] #除去开始多余的第一行 Self._varmat = varmat0[1:,:] def predict (self,x_test):                                                EPS = 1e-10 # prevents denominator of 0 classof_x_test = [] #用于存放测试集中各个实例的所属类别 for x_sample in x_test:matx_sample = Np.tile (x _sample, (Len (self._classes), 1)) #将每个实例沿列拉长, number of rows is the number of categories of the sample Mat_numerator = Np.exp (-(MATX_SAMPLE-SELF._MEANM AT) * * 2/(2 * self._varmat + EPS)) Mat_denominator = np.sqrt (2 * np.pi * self._varmat + EPS) list_ Log = Np.sum (Np.log (mat_numerator/mat_denominator), Axis=1) # class conditional probabilities in each category are added after the logarithm prior_class_x = List_log + np.log (self._priorlist) # Plus logarithmic prio of class priori probabilitiesR_class_x_index = Np.argmax (prior_class_x) # The index with the largest logarithm probability classof_x = self._classes[prior_class_x _index] # Returns an instance corresponding to the category Classof_x_test.append (classof_x) return classof_x_test def Scor E (self, X_test, y_test): j = 0 for I in range (Len (self.predict (x_test)): If Self.predict (x_test) [i] = = Y_test[i]: j + = 1 return (' accuracy: {:. 10%} '. Format (J/len (y_test)))
    • For the manual implementation of the Gaussian type NB classifier, the iris data is used to test the same as the Sklearn Library's classifier results, basically hovering around 93-96. This is due to multiple 28 splits, which is equivalent to a number of time-saving methods. To calculate more accurate accuracy, cross-validation is possible and multiple evaluation methods are selected, which are no longer implemented.
import numpy as npfrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn import preprocessing# 获取数据集,并进行8:2切分iris = datasets.load_iris()X = iris.datay = iris.target# print(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)nb = NaiveBayes()nb.fit(X_train,y_train)print(nb.predict(X_test))print(nb.score(X_test,y_test))#输出结果如下:[0, 2, 1, 1, 1, 2, 1, 0, 2, 0, 1, 1, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 2, 2, 2, 0, 1, 0, 1, 0]accuracy: 96.6666666667%
Two other
    • Naive Bayes, which is based on the hypothesis of attribute condition independence, is often difficult to establish in reality, so it produces a "semi-naïve Bayesian classifier". The basic idea is to take proper consideration of the interdependent information among some attributes, so that we do not need to do a complete joint probability calculation, and do not completely ignore the strong attribute dependency. "Independent dependency Estimation" is the most common strategy, assuming that each property depends on a maximum of one other property outside of the category. Including Spode method, Tan method, Aode method and so on.
    • Np.unique (): Returns a new array of non-repeating elements in the original array with elements from small to large.
y = np.array([1, 2, 9, 1,2,3])classes = np.unique(y)                     # 返回y中所有不重复的元素组成的新array([1,2,3,9])print(classes)                             # 结果为np.array([1,2,3,9])
    • Np.where (): operation on Array
'''1. np.where(condition, x, y)满足条件(condition),满足进行x操作,不满足进行y操作'''a= np.array([[9, 7, 3], [4, 5, 2], [6, 3, 8]])b=np.where(a > 5, 1, 0)               #对于a中的元素如果大于5,则改写成1,否则写成0.                print(b)输出结果:[[1 1 0] [0 0 0] [1 0 1]]
  "2. Np.where (condition) only condition (condition), without x and y, the output satisfies the conditional element's coordinates (equivalent to Numpy.nonzero). Here the coordinates are given in the form of a tuple, usually the original array has how many dimensions, the output tuple contains several arrays, respectively, corresponding to the dimension coordinates of the conditional element. "C = Np.array ([[9, 7, 3], [4, 5, 2], [6, 3, 8]]) d = Np.where (C > 5) #条 Pieces for elements greater than 5print (d) output as follows (tuple): (Array ([0, 0, 2, 2], Dtype=int64), array ([0, 1, 0, 2], Dtype=int64)) indicates that the following table is 00, and 01 20,22 elements meet the criteria. A = Np.array ([1,3,6,9,0]) b = Np.where (a > 5) print (b) output (Array ([2, 3], dtype=int64), the element that coordinates 2 and 3 satisfies, note the comma at the end, Indicates that the one-dimensional real output tuple is two-dimensional, 2_,3_ is nothing but back, a dimension of greater than or equal to 2 o'clock, the same tuple and a-dimensional number. The result of the output is that it can be directly labeled as an array. x = Np.array ([[1, 5, 8, 1], [2, 4, 6, 8], [3, 6, 7, 9], [6, 8, 3, 1]] print (x[b]) The result is an array of the 2nd, 3 rows of x [[3] 6 7] [9 6 8 3]] , equivalent to x[[2,3]],x[2,3] output as an element 9,x[[2],[3]] [9]. 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.