Summary of probability knowledge of mlapp--machine learning

Source: Internet
Author: User

The "Machine learning" course uses the original English version of Kevin P. Murphy's "Machine learning A Probabilistic Perspective", a book that uniquely describes all of the problems of machines learning from the mathematical perspective of probability theory. Requires a strong mathematical foundation. Because it is an English textbook, special Open a topic here to record their own learning process and various problems, for the use of memo and extrapolate.

After explaining the overview of machine learning, the second chapter begins with the knowledge of probability theory, and through subsequent studies, it is found that some of these probability theories have been partly studied in the probability theory course of undergraduate courses, but there are many other parts that are not involved in the current undergraduate stage or even the postgraduate stage. Make a summary here.

1. The School of Probability

Frequency School: The probability represents the frequency of events that occur repeatedly n times for a trial. What is required here is the need for repetitive testing, which is a better way to identify the generally repeatable test, which is also an experimental probability.

Bayesian School: Probability represents a characterization of the uncertainties that occur in an unknown event, and there is no requirement for repeated testing of the event. At the same time, for any unknown event, a probability can be used to characterize people's understanding of it.

By comparison, it can be found that the use of Bayesian probabilities is more reasonable for some events that cannot be repeated (for example, the average service life of the lamp generated by the lamp-making plant), and the repetition of the test is impractical. Therefore, the Bayesian school is the most applicable to the whole study.

2. Basic Knowledge

Probability: The event space Ω to the real field R mapping, for each event A, there is a real number p (a) corresponding to it, while satisfying: (1) nonnegative, P (A) >=0; (2) normative, p (Ω) = 1, (3) can be listed additive: P (a1+a2+ ... AN) = P (A1) +p (A2) +...P (an) where A1, A2 ... An are complementary and compatible events.

Basic probability formula:


Full probability formula and Bayesian formula:


Generic Bayesian classifier:

( θ is the parameter of the model)

3, Discrete type distribution

(1) Two distribution binomial

K is the possible result of each test and n is the number of times the test is conducted. Bernoulli test is k={0,1} and N=1 experiment, for N (n>1) of N venue resident The experiment is two distribution, the distribution function is as follows:


mean=θ,variance=nθ (1-θ). The typical test for the two distribution description is coin toss, each with a positive or negative two-way result. This is used in the machine learning classification algorithm to describe the characteristics of the two values, that is, the characteristics of each data is two states (typically 0 and 1), to characterize the current data has this feature , so you can use two distributions to describe the current characteristics of the distribution.

(2) Multi-item distribution Multinormial

When the results of each test may have K (K>2), that is, a characteristic is not only the appearance of the representation, but the need to use a specific value to characterize the impact of the feature , at this time can be described with a multi-item distribution.


Here, when the k=2 is two states, it can be seen that the distribution of multiple degradation to two distribution, you can see the x1=k,x2=n-k,x1+x2=n condition is satisfied. Among them, when N=1, that is, only one test, at this time the distribution is called Multidimensional Bernoulli distribution, because each possible state has K (K>2), also become discrete distribution (discrete distribution) or classification distribution (categorical Distribution), recorded as Cat (x|θ):


(3) Poisson distribution Poisson

Variable x={0,1,2.....},λ>0, distributed as follows:


Poisson distributions can be used to simulate events sent in time series with no memory.

4. Continuous type distribution

(1) Normal distribution Gaussian (normal)


Mean=u,mode=u,variance=σ^2. Widely used in statistics, the first two parameters are very well understood, respectively, mean and standard deviation, at the same time, the central limit theorem is independent of the distribution of random variables and distributions of approximate Gaussian distribution, can be used to simulate noise data; Thirdly, Gaussian distribution uses the smallest hypothesis that it has the maximum entropy; The mathematical form is relatively simple and very beneficial to the realization.

(2) Student t distribution


mean=u,mode=u,variance=νσ^2/(ν-2), ν>0 is the degree of freedom, the variance is defined at ν>2, and the mean value is defined at ν>1. The distribution form is similar to the Gaussian distribution, which makes up a disadvantage of Gaussian distribution, that is, the Gaussian distribution is very sensitive to outliers, but the student t distribution is more robust . General Settings ν=4, in most practical problems have good performance, when the ν is greater than or equal to 5 o'clock will be the robustness, and will quickly converge to the Gaussian distribution.

In particular, when ν=1, it is called Cauchy distribution (Cauchy).

(3) Laplace distribution Laplace


Mean=u,mode=u,variance=2b^2. Also known as the two-sided exponential distribution, which leads to an exponential exponent of the absolute value, is therefore not x=u at the point of the order. B (b>0) is a scaling factor used to adjust the dispersion of the data. The Laplace distribution is more robust to outlier data . At the same time, a probability density greater than the Gaussian distribution is given at the X=u, which can be used to correct sparse data in the model.

(4) Gamma distribution


Mean=a/b,mode= (A-1)/B,variance=a/b^2,mean is defined at a>1, variance is defined at a>2. Where the range of variable t is t>0,a>0 called shape parameter, b>0 is called rate parameter.

    • Exponential distribution: a=1,b=λ, Expon (x|λ) =ga (x|1,λ), this distribution describes the continuous Poisson process, with a discrete Poisson distribution conjugate.
    • Erlang Distribution: Erlang (x|λ) =ga (x|2,λ)
    • Chi-squared distribution (chi-squared distribution): CHISQ (x|v) =ga (X|V/2,1/2), which is the sum of the squares of the random variables of n Gaussian distribution.
when you use 1/x instead of a variable in the gamma distribution, you get the inverse gamma distribution, which is:

Mean=b/(A-1), Mode=b/(a+1), variance=b^2/(A-1) ^2 (a-2), where mean is defined at a>1, variance is defined at a>2.

(5) Beta distribution


defined in the [0,1] interval, requires a>0,b>0, when a=b=1 is the uniform distribution on [0,1]. Mean=a/(A+b), mode= (A-1)/(a+b-2), variance = AB/(a+b) ^2 (a+b+1). This distribution and the discrete two-item distribution is conjugate , in the naive Bayesian classification application, when the likelihood distribution is the two-item distribution, the choice beta distribution is the conjugate prior distribution, then the posterior distribution is also the beta distribution, is very convenient for the practical operation and the computation.

(6) Pareto distribution


mean=km/(k-1) (k>1), mode=m,variance=mk^2/(k-1) ^2 (k-2), the distribution pair should have a k>2 ' s law that describes the relationship between the ranking of words and the frequency with which they appear. x must be larger than a constant m, but not more than K, and when K is infinity, the distribution tends to Δ (x-m). The above distribution is very effective in information retrieval for the word frequency estimation in index construction.

(7) Dirichlet distribution Dirichlet


Mean (Xk) =ak/a0, mode (Xk) = (ak-1)/(A0-k), variance (XK) = AK (A0-ak)/a0^2 (a0+1). This is the distribution of the beta distribution under multidimensional conditions, the corresponding parameters and variables are a vector, this distribution and discrete multi-term conjugate , in the naïve Bayesian classification application, likelihood uses the multi-item distribution, chooses the Dirichlet distribution as the prior distribution, The posterior distribution was also Dirichlet distributed.

The above is a summary of the probability distribution used in machine learning, for the follow-up study of the preparation and review.

Summary of probability knowledge of mlapp--machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.