Pattern Recognition and machine learning (mode recognition and computer learning) notes (1)

Source: Internet
Author: User

by Yunduan Cui

This is my own PRML study note, which is currently in the update.

Chapter II probability distribution of probability distributions

This chapter introduces the probability distribution model to be used in the book, which is the basis of the later chapters. Known as a finite set \ (\{x_{1}, x_{2},..., x_{n}\}\), the probability distribution is used to create a model: \ (p (x) \). This problem is also known as density estimation ( density estimation ).

Main content
1. Binomial and multinomial distributions Bernoulli distribution and polynomial distribution for discrete random variables
2. Gaussian distribution Gaussian distribution for continuous random variables
3. Parameter estimation for Gaussian distribution: Frequency School/Bayesian School
4. Conjugate priori, and unification of each probability distribution
5. Parameter/No parameter method

2.1 Binary Variables binary variable
    • Bernoulli distribution (Bernoulli distribution)

Define binary random variables \ (x \in \{0, 1\}\), the Bernoulli distribution satisfies:

\ (Bern (X|\MU) =\mu^{x} (1-\MU) ^{1-x}\)

where \ (\mu\) is the parameter that controls the distribution, in accordance with:

\ (P (X=1|\MU) =\mu\).

The expectation and variance of the Bernoulli distribution satisfies:

\ (\mathbb{e}[x] = \mu\)
\ (Var[x] = \mu (1-\MU) \)

When there is an observation set \ (\mathcal{d}=\{x_{1}, x_{2},..., x_{n}\}\) and assuming that the observations are independent of each other, we can get a likelihood function (likelihood function) about \ (\mu\):

\ (P (\mathcal{d}|\mu) =
\displaystyle{\prod_{n=1}^{n}}p (X_{N}|\MU) =\displaystyle{\prod_{n=1}^{n}}\mu^{x_{n}} (1-\MU) ^{1-x_{n}}\)

In the case of the maximum likelihood function, this form is very inconvenient, and we calculate the logarithm of \ (P (\MATHCAL{D}|\MU) \) (The conversion is connected by adding):

\ (\ln{p (\MATHCAL{D}|\MU)}=
\displaystyle{\sum_{n=1}^{n}}\ln{p} (X_{N}|\MU) =\displaystyle{\sum_{n=1}^{n}}\{x_{n}\ln{\mu}+ (1-x_{n}) \ln{(1-\ MU)}\}\)

To obtain the maximum value, get \ (\mu_{ml}=\frac{1}{n}\displaystyle{\sum_{n=1}^{n}}x_{n}\) This is the maximum likelihood estimate of the Bernoulli distribution on the observation set. equivalent to minimizing the risk of experience

The maximum likelihood estimate also has the flaw, if the observation set is too few, the overfitting is very easy to occur (for example, throws the coin three times if is the head face up, the maximum likelihood estimate will directly judge upward probability is \ (100\%\), this obviously is not correct). We can avoid this situation by introducing a priori \ (\mu\). becomes the maximum posterior estimate, the structural risk minimization --See the beta distribution later

    • Two items distributed (binomial distribution)

The observed set in the Bernoulli distribution \ (\mathcal{d}\) is given, and we can deduce two distributions when we only know \ (x=1\) The number of observations is \ (m\):

\ (Bin (m| N,\MU) =\binom{n}{m}\mu^{m} (1-\MU) ^{n-m}=\frac{n!} {(n-m)!m!} \mu^{m} (1-\MU) ^{n-m}\)

This is the probability of how many times an event occurs. The expectation and variance of the two-item distribution are satisfied:

\ (\mathbb{e}[m] = \displaystyle{\sum_{m=0}}mbin (m| N,\MU) =n\mu\)
\ (Var[m] = \displaystyle{\sum_{m=0}} (M-\mathbb{e}[m]) ^{2}bin (m| N,\MU) =n\mu (1-\MU) \)

    • Beta distribution (distribution)

This section considers how to introduce a priori information into a binary distribution and introduce a conjugate priori (conjugacy prior)

Beta distribution is introduced as a priori probability distribution, which is controlled by two hyper-parameters \ (A, b\).

\ (Beta (\mu|a,b) =\frac{\gamma (a+b)}{\gamma (a) \gamma (b)}\mu^{a-1} (1-\MU) ^{b-1}\)

\ (\gamma (x) \equiv \int_{0}^{\infty}u^{x-1}e^{-u}du\)

The coefficients guarantee the normalization of beta distribution (\int_{0}^{\infty}beta (\mu|a,b) d\mu=1\). The expectation and variance of the beta distribution are met:

\ (\mathbb{e}[\mu] = \frac{a}{a+b}\)
\ (Var[m] = \frac{ab}{(a+b) ^{2} (a+b+1)}\)

Cond

Pattern Recognition and machine learning (mode recognition and computer learning) notes (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.