In Fisher Vector (1), the linear core is introduced, in order to meet the different requirements, the practical application will use a variety of kernel functions, fisher kernel is one of them. Fisher Kernel
At this point, there is still a two classification problem for one (1,−1) (1,−1), we want to learn P (x,y) =p (x|y) p (Y) p (x,y) =p (x|y) p (y)
Because
So P (x,y) =p (x|y) p (y) =p (x|θy) p (Y) p (x,y) =p (x|y) p (y) =p (x|θy) p (y).
The
If P (Y) =p (Y¯) p (y) =p (Y¯), the upper type can be written as P (y|x) =σ (LNP (X|θy¯) −LNP (x|θy)) P (y|x) =σ (LNP (X|θy¯) −LNP (x|θy))
A first-order Taylor expansion of LNP (X|θy) LNP (X|θy) has LNP (X|θy) ≈LNP (x|θ) + (θy−θ) ln′p (x|θ) LNP (X|θy) ≈LNP (x|θ) + (θy−θ) ln′p (x|θ)
The corresponding LNP (X|θy¯) ≈LNP (x|θ) + (θy¯−θ) ln′p (x|θ) LNP (X|θy¯) ≈LNP (x|θ) + (θy¯−θ) ln′p (x|θ)
Bring the two expansion to the top: P (y|x) =σ ((Θy−θy¯) ln′p (x|θ)) p (y|x) =σ ((Θy−θy¯) ln′p (x|θ))
Make Ux=ln′p (x|θ) =∂∂ΘLNP (x|θ) ux=ln′p (x|θ) =∂∂ΘLNP (x|θ), which is Fisher Score. So P (y|x) =σ (Θy−θy¯) UX) =σ ((θ1−θ−1) UX) p (y|x) =σ ((Θy−θy¯) UX) =σ ((θ1−θ−1) UX)
This introduces the concept of a kullback–leibler divergence, also known as relative entropy, which is usually used to calculate the distance between two distributions, such as the distance between θ1 and Θ−1θ1 and θ−1: DKL (θ1| | θ−1) =∫∞−∞p (x|θ1) LNP (x|θ1) P (x|θ−1) dx DKL (θ1| | θ−1) =∫−∞∞p (x|θ1) LNP (x|θ1) P (x|θ−1) dx
∫∞−∞p (x|θ) ∂∂ΘLNP (x|θ) dx=∫∞−∞p (x|θ) ∂∂θp (x|θ) P (x|θ) dx=∂∂θ∫∞−∞p (x|θ) dx=0∫−∞∞p (x|θ) ∂∂ΘLNP (x|θ) dx=∫−∞∞p (x|θ) ∂∂θP (x |θ) P (x|θ) dx=∂∂θ∫−∞∞p (x|θ) dx=0
According to the first-order expansion of the above, if we do the first-order expansion relative entropy is 0 0, so LNP (x|θ1) in the θ−1 LNP (x|θ1) in the θ−1 of the second-order expansion LNP (X|Θ1) ≈LNP (x|θ−1) + (θ1−θ−1) ln′p (x|θ−1) + (θ1−θ−1) 2∂2LNP (x|θ−1) 2∂θ−1 LNP (x|θ1) ≈LNP (x|θ−1) + (θ1−θ−1) ln′p (x|θ−1) + (θ1−θ−1) 2∂2LNP (x|θ−1) 2∂θ−1
So DKL (θ1| | θ−1) =∫∞−∞p (x|θ1) LNP (x|θ1) P (x|θ−1) dx=∫∞−∞p (x|θ1) LNP (x|θ1) −p (x|θ1) LNP (x|θ−1) dx=∫∞−∞p (x|θ1) (θ1−θ−1) Ln′P (x|θ−1) + ∂2LNP (x|θ−1) 2∂θ−1 (θ1−θ−1) 2dx DKL (θ1| | θ−1) =∫−∞∞p (x|θ1) LNP (x|θ1) P (x|θ−1) dx=∫−∞∞p (x|θ1) LNP (x|θ1) −p (x|θ1) LNP (x|θ−1) dx=∫−∞∞p (x|θ1) (θ1−θ−1) Ln′P (x|θ−1) + ∂2LNP (x|θ−1) 2∂θ−1 (θ1−θ−1) 2dx
hypothesis, so DKL (θ1| | θ−1) = (θ1−θ−1) 2∫∞−∞p (x|θ1) ∂2LNP (x|θ−1) 2∂θ−1dx DKL (θ1| | θ−1) = (θ1−θ−1) 2∫−∞∞p (x|θ1) ∂2LNP (x|θ−1) 2∂Θ−1DX
The Fisher Information Matrix I (θ) =∫p (x|θ) ∂2∂ΘLNP (x|θ) 2dx I (θ) =∫p (x|θ) ∂2∂ΘLNP (x|θ) 2DX is introduced here, so DKL (θ1| | θ−1) = (θ1−θ−1) 2I (θ1) DKL (θ1| | θ−1) = (θ1−θ−1) 2I (θ1)
I'm not quite sure about this one, π__π. According to the relative entropy, we can specify a prior distribution of θθ, and we can see that when θ1 and θ1θ1 and θ1 are quite close, the relative entropy is smaller and the probability is greater (this is Yumbo said, did not understand). You can define P (θ) =e (θ1−θ−1) 2I (θ) P (θ) =e (θ1−θ−1) 2I (theta)
At this point, we can calculate the θθ according to the maximum posterior probability, and similar, we can finally get P (y∣x) =σ (∑i=1nλi (Utxii−1ux)) p (y∣x) =σ (∑i=1nλi)).
Similar to linear kernel functions, this makes K (xi,x) =utxii−1ux K (xi,x) =uxiti−1ux, called Fisher Kernel.
Having said so much, and finally to the point, it can be seen that Fisher kernel maps the original feature X-UX=∂∂ΘLNP (x|θ) UX=∂∂ΘLNP (x|θ) to another space, and the Fisher score is the Fisher Vector. Many literatures introduce Fisher vectors directly to the formula, and then say the parameters of the likelihood function are very discriminating and therefore used as a feature. I personally think that rather than understanding the Fisher vector as a mapping or transformation is simpler.
Don't forget, the previous mention of Fisher vector is commonly used in combination with the Gaussian mixture model, the general process is:
Therefore, when the given training set feature x1,x2,.., xm∈rd x1,x2,.., xm∈rd, the Gaussian mixture model is trained first, and the Parameters θ= (μk,σk,πk) k=1,..., kθ= (μk,σk,πk) k=1,..., K P (x|θ) =∑K=1KΠKP (x|μk,σk) p (x|θ) =∑K=1KΠKP (x|μk,σk)
The next step is to use UX=∂∂ΘLNP (x|θ) UX=∂∂ΘLNP (x|θ) to compute the derivation of the three parameters and concatenate the results (∂∂ΜKLNP (x|θ), ∂∂ΣKLNP (x|θ), ∂∂ΠKLNP (x|θ)) K=1,... K (∂∂ΜKLNP (x|θ), ∂∂ΣKLNP (x|θ), ∂∂ΠKLNP (x|θ)) K=1,... K
The vector of this 2K (d+1) 2K (d+1) dimension is the ultimate feature, sometimes ∂∂ΠKLNP (x|θ) ∂∂ΠKLNP (x|θ) is not computed, and the ultimate feature is the 2Kd 2Kd dimension.
from:http://bucktoothsir.github.io/blog/2014/11/27/10-theblog/