Yesterday, after settling the logistic regression, there was another interesting guy today. The logistic regression is very powerful, but it also has its weaknesses. Its biggest weakness is only to tell you is or not, and can not tell you XX is YY. This can only be a small step in the pursuit of artificial intelligence. After solving the problem of yes/no, we still need to solve what this problem. A clever computer scientist (perhaps a mathematician) has devised a new model that can tell us what. Which is the log-linear model.
--Far from the real AI, perhaps it is a scientific problem rather than an engineering problem, remember the next day I studied Mechine learning
1, Log-linear
If you want the computer to tell us what, according to the previous idea, still should start from the probability. should be to find a model, the input of the model is some information, the output of the model is the possibility of the label under known information. There are parameters in this model, and different labels correspond to different parameters. For the test target, we can infer which label to compare with if we compare it with the probability of-->model (the Para). The label should be limited. The next task is to find the corresponding parameter for the label.
This set of parameters should make P (y|x;w) maximum, where y is what: washing machine, lamp, refrigerator 、、、、 x is a sample: color, shape, size, weight, price, material ...
Concept we have, and then there is only one question: what does the model look like? Since it is a probabilistic model, I think there are two important features first:
1. It should be more than one, preferably each can express <label-dimension > , the sample has a D dimension, the equation should have a D, after all, we always hope that our various sample dimensions are not coupled at the time of calculation.
2. Its value range should be between 0~1, because it is the probability.
Next we are going to design a model with good mathematical properties, after all, we have to calculate the gradient of the parameters .... What a complicated thing to do ....
First, this time the equation is no longer the X-man thing, because Y is also diverse: Then for an item it should be expressed as: Fij (x_i,y_1).
Second, these items should be additive relationships, and we use weights to control this contribution to P (y_j|x), which means that the "compositing effect" of these items should be expressed as: Wij is the weight of each item.
i = 1:d j=1
∑ wij* Fij (X_i,y_j)
Next, the above pile and the results are not necessarily positive, so we must first ensure that it is greater than 0, the old-fashioned index: exp (∑wij*Fij (x_i,y_j))
Finally, if we find a way to make the whole thing less than 1, then it is basically in line with our requirements, how to do? Obviously, all of the parts of speech are counted and summed again, it will be greater than the individual speech.
In fact, what we spell out is the Log-linear model.
So, given a training set, we have to make the label the most accurate and should make p as large as possible. Essentially, to make the molecules as large as possible, we have the goal of training:
So the last question remains, how to determine F
2. Characteristic equation F
F is called the characteristic equation, it has a total of c*d, C is the number of labels, and D is the sample dimension. In other words, each dimension is related to a feature.
I=1~d, C=1~c
That is, FJ corresponds to all the labels, and each label has a D f. is different. This can automatically generate all the required FJ (washing machine corresponding to 1~d number, the hair dryer will automatically correspond to the d+1~2d number ...) ), this is a naive FJ Setup method, which considers that some items in FJ are used only when Y takes a particular label, and when Y takes another label, the set of weights is masked. For example: when we discuss whether an object is a washing machine, the training weights of the hair dryer are blocked. This is essentially a lot of many logistic regression in parallel.
For ease of use, make the following designs: F--->0/1. It's good to understand that a label and a sample dimension are either related or irrelevant, as to how much of the final weight will be adjusted.
Here is a word for a speech discrimination, A1~A4 is related to the noun discriminant function here B (noun) should be all taken 1, a1~a4, shielding other
A5~a8 may be related to the verb discrimination, if the y= verb is selected, then shielding and nouns, adjectives and other related discrimination.
Machine learning--log-linear models & conditions with airports