(i) Naive Bayesian polynomial event model
in the previous note, the most basic NB model is called the multivariate Bernoulli event model (multivariate Bernoulli event models, hereinafter referred to as NB-MBEM). The model has several extensions, one of which is the multivalued of each component already mentioned in the previous note, which extends P (xi|y) from the Bernoulli distribution to the polynomial distribution, and one that has been mentioned in the previous note about discretization of continuous variable values. In this paper, we will introduce a NB model, which is very different from the multivariate Bernoulli event model, that is, the polynomial event model (multinomial, abbreviated NB-MBEM). First, Nb-mem changes the representation of eigenvectors. In Nb-mbem, each component of the eigenvector represents whether the word on the index in the dictionary appears in the text, its value range is {0,1}, and the length of the eigenvector is the size of the dictionary. in Nb-mem, the value of each component in the eigenvector is the index of the word in the dictionary at the position of the component in the text, and its value range is {,..., | v|},| V| is the size of the dictionary, and the length of the eigenvector is the number of words in the corresponding sample text. formal representations are: The M training samples are expressed as:{x(i), y(i); I=1,..., m} x (i) 1 (i) ,x (i) ,..., x ni (i) )
represents the first sample, a total of NI words, each word in the dictionary number xJ(i). For example, in Nb-mbem, a document's eigenvectors might look like this:
Its vector representation in Nb-mem is shown below:
In Nb-mem, assume that the text is generated as follows:
1. Determine the type of text, such as whether it is junk text, financial or educational category; 2. Traverse the various positions of the text, generate the words with the same polynomial distribution, and create the words independently of each other. from the above generation process,Nb-mem assumes that the text category obeys the polynomial distribution or Bernoulli distribution, whereas all the words in the dictionary obey the polynomial distribution. The build process can also be interpreted as follows: Select a category in the polynomial distribution to which the class is subjected, and then traverse the entire text, selecting the word in the polynomial distribution to which the word is subjected, and placing it in the corresponding position in the text. Thus,the parameters of the Nb-mem are as follows:
Thus, we can get the maximum likelihood estimate of the parameters on the training set:
The maximum likelihood estimator can be obtained for the maximum likelihood estimation of each parameter:
Use Laplace smoothing on φk|y=1 and φk|y=0 to get the formula as follows:
among them, | V| is the size of the dictionary. compared with the previous formula, the denominator has a NI, the molecule from 0/1 into K. for the formula
The molecule means that all messages labeled 1 are summed, that is, only junk e-mail is considered, and then all the words in the spam message are summed, which should add up to the number of times the word K appears in the spam message.
In other words, a molecule is actually a summation of the number of occurrences of all junk e-mail morphemes k in the training set. the denominator means summing the set of training samples, and if one of the samples is spam (Y=1), add up the length of it, so the denominator means the total length of all spam in the training set. so this ratio means the percentage of the word k in all spam messages. As an example:
If the message is only a,b,c these three words, their position in the dictionary is three-in-one, the first two messages are only two words, the last two letters 3 words. Y=1 is junk mail. So,
If the new e-mail is b,c, then the feature is represented as {2,3}
so
Then the e-mail message is the spam probability is 0.6.
Note that the difference between this formula and naive Bayes lies in the φk|y=1 for the whole sample, while the naive Bayes is for each of the φxj=1|y= 1, and the eigenvalue dimension here is uneven.
(ii) Neural Networksbefore the introduction of either the Perceptron algorithm or logistic regression or the naïve Bayesian model just introduced (naive Bayesian algorithm is a polynomial model of polynomial distribution, so also belongs to the logistic regression model), the final result is reflected in the data is a straight line or a super plane, But if the data is not linear, the performance of these models will become worse. In view of this problem, there are many algorithms for classifying non-linear data, and neural network is one of the earliest. for a logistic regression model, it can be represented as shown:
Where Xi is the individual component of the input eigenvector, sigmoid is the calculated unit, and output is the function outputs. The sigmoid compute unit has the parameter θ, which functions as:
The neural network, in turn, combines such a computational unit as shown in the following:
Where A1,a2,a3 is the output of the intermediate unit. As you can see, the neural network shown in the figure has four parameters, respectively, of four sigmoid units. The relationships between these parameters are described in the following formula:
Learning these parameters requires a cost function such as:
This is the first time in the video mentioned in the two cost function (quadratic costfunction), you can use the gradient descent method to minimize the cost function to obtain parameters, in the neural network gradient descent algorithm has a special name called the inverse propagation algorithm. in the sample diagram of the neural network above, the input is directly connected to the hidden layer (hiddenlayer), and the output is called the output layer (outputlayer) of the direct necklace. One of the characteristics of the neural network algorithm is that we do not know the meaning of the hidden layer computing, another feature is that the neural network has more local optimal values, can be set by multiple random initial values and then run the gradient descent algorithm to obtain the optimal value. Next, a video showing the application of two neural network implementations is presented. One is the Hammerton digital recognition application, the handwritten numeral recognition, the application of the author is Yannlecun, he is known for character recognition and convolutional neural network. Another application is the Nettalk neural network, which uses a neural network to read the text, and the author is Terryj.sejnowski.
(c) function interval and geometric interval of support vector machineto understand support vector machines (vectormachine), you must first understand the function interval and the geometry interval. Assume that the dataset is linearly divided. first change the symbol, the category y desirable value from {0,1} to { -1,1}, assuming that the function g is:
The objective function H also consists of:
Into:
wherein, Equation 15 x,θεRn+1, and X0=1. In Equation 16, x,ωεRN,b replaces the role of x0 in Equation 15.
by Equation 16, we learned that Ω,b can uniquely determine a hyper-plane. a point (x(i), y(i)) to the function interval of the Ω,B determined by the plane is:
the function intervals for the hyper-plane and the entire training set are:
Equation 17 also has a property, that is, for correctly categorized data points, the function interval is not less than 0.
the problem with function spacing is that as long as you multiply the ω,b, you can make the function interval grow larger. To solve this problem, we have the definition of geometry interval, which is defined as follows:
that is, in | | ω| | The minimum value of the function interval under the =1 condition. The meaning of geometric interval and function interval is to add an index to the model based on the training set, so that the model not only guarantees the correctness of the classification result, but also guarantees the certainty of the categorical result.
Stanford University Machine Learning public Class (VI): Naïve Bayesian polynomial model, neural network, SVM preliminary