Implementation of a Gaussian naive Bayesian classifier code
- On-line search does not call Sklearn implementation of naïve Bayesian classifier is very few, even if there is also a combination of text classification of the polynomial or Bernoulli type, so I wrote a direct encapsulation of the Gaussian type NB classifier, of course, compared with the real source code a lot less properties and methods, interested can add their own. The code is as follows (with detailed comments):
Class Naivebayes (): "Gaussian naive Bayesian classifier" Def __init__ (self): Self._x_train = None Self._y_train = None Self._classes = None Self._priorlist = None Self._meanmat = None Self._varmat = None def fit (s Elf, X_train, y_train): Self._x_train = X_train Self._y_train = Y_train self._classes = np.un Ique (Self._y_train) # Gets the various categories priorlist = [] Meanmat0 = Np.array ([[0, 0, 0, 0]]) Varmat0 = Np.array ([[[0, 0, 0, 0]]) for I, C in enumerate (self._classes): # Calculates the mean, variance, prior probability of each species X_index_c = Self._x_train[np.where (Self._y_train = = c)] # "Matrix" consisting of samples belonging to a category Priorlist.append (X_index_c . shape[0]/self._x_train.shape[0]) # Calculate a priori probability of a category X_index_c_mean = Np.mean (X_index_c, axis=0, Keepdims=true) # calculates the mean of each feature under this category, resulting in a two-dimensional state [[3 4 6 2 1]] X_index_c_var = Np.var (X_index_c, Axis=0, keepdims=true) # Variance Meanmat0 = Np.appEnd (Meanmat0, X_index_c_mean, axis=0) # The characteristic mean matrix under each category is a new matrix, each representing a category. Varmat0 = Np.append (Varmat0, X_index_c_var, axis=0) self._priorlist = priorlist Self._meanmat = meanmat0[1:, :] #除去开始多余的第一行 Self._varmat = varmat0[1:,:] def predict (self,x_test): EPS = 1e-10 # prevents denominator of 0 classof_x_test = [] #用于存放测试集中各个实例的所属类别 for x_sample in x_test:matx_sample = Np.tile (x _sample, (Len (self._classes), 1)) #将每个实例沿列拉长, number of rows is the number of categories of the sample Mat_numerator = Np.exp (-(MATX_SAMPLE-SELF._MEANM AT) * * 2/(2 * self._varmat + EPS)) Mat_denominator = np.sqrt (2 * np.pi * self._varmat + EPS) list_ Log = Np.sum (Np.log (mat_numerator/mat_denominator), Axis=1) # class conditional probabilities in each category are added after the logarithm prior_class_x = List_log + np.log (self._priorlist) # Plus logarithmic prio of class priori probabilitiesR_class_x_index = Np.argmax (prior_class_x) # The index with the largest logarithm probability classof_x = self._classes[prior_class_x _index] # Returns an instance corresponding to the category Classof_x_test.append (classof_x) return classof_x_test def Scor E (self, X_test, y_test): j = 0 for I in range (Len (self.predict (x_test)): If Self.predict (x_test) [i] = = Y_test[i]: j + = 1 return (' accuracy: {:. 10%} '. Format (J/len (y_test)))
- For the manual implementation of the Gaussian type NB classifier, the iris data is used to test the same as the Sklearn Library's classifier results, basically hovering around 93-96. This is due to multiple 28 splits, which is equivalent to a number of time-saving methods. To calculate more accurate accuracy, cross-validation is possible and multiple evaluation methods are selected, which are no longer implemented.
import numpy as npfrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn import preprocessing# 获取数据集,并进行8:2切分iris = datasets.load_iris()X = iris.datay = iris.target# print(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)nb = NaiveBayes()nb.fit(X_train,y_train)print(nb.predict(X_test))print(nb.score(X_test,y_test))#输出结果如下:[0, 2, 1, 1, 1, 2, 1, 0, 2, 0, 1, 1, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 2, 2, 2, 0, 1, 0, 1, 0]accuracy: 96.6666666667%
Two other
- Naive Bayes, which is based on the hypothesis of attribute condition independence, is often difficult to establish in reality, so it produces a "semi-naïve Bayesian classifier". The basic idea is to take proper consideration of the interdependent information among some attributes, so that we do not need to do a complete joint probability calculation, and do not completely ignore the strong attribute dependency. "Independent dependency Estimation" is the most common strategy, assuming that each property depends on a maximum of one other property outside of the category. Including Spode method, Tan method, Aode method and so on.
- Np.unique (): Returns a new array of non-repeating elements in the original array with elements from small to large.
y = np.array([1, 2, 9, 1,2,3])classes = np.unique(y) # 返回y中所有不重复的元素组成的新array([1,2,3,9])print(classes) # 结果为np.array([1,2,3,9])
- Np.where (): operation on Array
'''1. np.where(condition, x, y)满足条件(condition),满足进行x操作,不满足进行y操作'''a= np.array([[9, 7, 3], [4, 5, 2], [6, 3, 8]])b=np.where(a > 5, 1, 0) #对于a中的元素如果大于5,则改写成1,否则写成0. print(b)输出结果:[[1 1 0] [0 0 0] [1 0 1]]
"2. Np.where (condition) only condition (condition), without x and y, the output satisfies the conditional element's coordinates (equivalent to Numpy.nonzero). Here the coordinates are given in the form of a tuple, usually the original array has how many dimensions, the output tuple contains several arrays, respectively, corresponding to the dimension coordinates of the conditional element. "C = Np.array ([[9, 7, 3], [4, 5, 2], [6, 3, 8]]) d = Np.where (C > 5) #条 Pieces for elements greater than 5print (d) output as follows (tuple): (Array ([0, 0, 2, 2], Dtype=int64), array ([0, 1, 0, 2], Dtype=int64)) indicates that the following table is 00, and 01 20,22 elements meet the criteria. A = Np.array ([1,3,6,9,0]) b = Np.where (a > 5) print (b) output (Array ([2, 3], dtype=int64), the element that coordinates 2 and 3 satisfies, note the comma at the end, Indicates that the one-dimensional real output tuple is two-dimensional, 2_,3_ is nothing but back, a dimension of greater than or equal to 2 o'clock, the same tuple and a-dimensional number. The result of the output is that it can be directly labeled as an array. x = Np.array ([[1, 5, 8, 1], [2, 4, 6, 8], [3, 6, 7, 9], [6, 8, 3, 1]] print (x[b]) The result is an array of the 2nd, 3 rows of x [[3] 6 7] [9 6 8 3]] , equivalent to x[[2,3]],x[2,3] output as an element 9,x[[2],[3]] [9].