CCJ prml Study note-chapter 1-1: Introduction

Last Update:2016-06-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chapter 1-1: Introduction

Christopher M. Bishop, PRML

Chapter 1-1: Introduction

1. Basic Terminology

A training set: , where each is a d-dimension column vector, i.e, .
target vector: , resulting in a pair: for supervised learning.
generalization: the ability to categorize correctly new examples this differ from those used for training are know N as generalization.
pre-process stage, aka feature extraction. Why pre-processing? REASONS:1) transform makes the pattern recognition problem is easier to solve; 2) pre-processing might also be performed on order to speed up computation due to dimensionality reduction.
Reinforcement Learning: is concerned with the problem of finding suitable actions to take in a given situation in Order to maximize a reward. Typically there is a sequence of states and actions in which the learning algorithm are interacting with its envir Onment. In many cases, the current action isn't only affects the immediate reward but also have an impact on the reward at all Subseq Uent time steps. A General feature of reinforcement learning are the trade-off between exploration, in which the system TR IES out new kinds of actions effective they is, and exploitation, which the system makes use of AC tions that is known to yield a high reward. Too strong a focus on either exploration or exploitation would yield poor results.

2. Different Applications:

1) Classification in supervised learning, training data , to learn the model , where consists of a finite number of discrete categories;
2) regression in supervised learning, training data , to learn the model , where the output consists of one or more continuous variables.
3) Unsupervised learning, training data , without tag vector , including:
- Clustering, to discover groups of similar examples within the data;
- Density estimation, to determine the distribution of data within the input space;
- Visualization, to project the data from a high-dimensional space down to the other or three dimensions for the purpose of visual ization.

3. Linear supervised learning:linear prediction/regression3.1 flow-work:

Here the model was represented by parameters , for unseen input , to make a prediction

3.2 Linear Prediction

3.3 Optimization Approach

Error function

Finding the solution by differentiation:

Note:matrix differentiation, , and .

We get

The optimal parameter .

4. A Regression problem:polynomial Curve Fitting4.1 Training data:

Given a training data set comprising N observations O X, written , together with corresponding observations of th E values of T, denoted .

4.2 Synthetically Generated data:

Method:
i.e.,function value y (x) (e.g., ) + Gaussian noise.
The input data setxIn Figure 1.2 is generated by choosing values of, for, spaced uniformly in range [0, 1], and the target data setTwas obtained by first computing the corresponding values of the function sin (2πx) and then addinga small level of random noiseHave a Gaussian distribution to each such point in order to obtain the corresponding value.

Discussion:
By generating data in the this, we is capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations is corrupted by random n Oise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more t Ypically is due to there being sources of variability , is themselves unobserved.

4.3 Why called Linear model?1) polynomial function

whereis theOrderof the polynomial, andDenotesRaised to the power of. The polynomial coefficientsis collectively denoted by the vector.

Question: Why called linear model or linear prediction? Why "linear"?
Answer: Note that, although the polynomial function is a nonlinear function of, it D OES be a linear function of the coefficients . Functions, such as the polynomial, which is linear in the unknown parameters has important propert IES and is called linear models and would be discussed extensively in Chapters 3 and 4.

2) Error Function

Where the factor of is included for later convenience. We can solve the quadratic function of the coefficients , and find a unique optimal solution in closed form, demo Ted by .

4.4 Remaining problems related to polynomial Curve Fitting

model Comparison or model Selection : Choosing the order M of the polynomial. A dilemma:large m /span> causes over-fitting, small M gives rather poor fits to the distribution of training data.
over-fitting : In fact, the polynomial passes exactly through each data point and E (W) = 0. However, the fitted curve oscillates wildly and gives a very poor representation of the function . This latter behavior is known as over-fitting .
model complexity:??? # of parameters in the model.

4.5 Bayesian perspectiveleast squares (i.e., Linear Regression) Estimate v.s. Maximum likelihood Estimate:

We shall see this least squares approach (i.e., linear regression) to finding the model parameters represents a Specif IC Case of maximum likelihood (discussed in sections 1.2.5), and that the over-fitting problem can be understood a S a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shall see this there is no difficulty from a Bayesian perspective in employing models for which the number of parameter s greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set.

How to formulate the likelihood for linear regression? (To is discussed in later sections.) 4.6 Regularization, regularizer:to control over-fitting

Regularization:to involve adding a penalty term to the error function (1.2) in order to discourage the Coefficie NTS from reaching large values.
form of regularizers:e.g., a quadratic regularizer, called Ridge regression. In the context of neural networks, this approach is
Known as weight decay.

The modified error function includes the terms:

The first term:sum-of-squares error;
The second Term:regularizer term, which have the desired effect of the reducing of the magnitude.

CCJ prml Study note-chapter 1-1: Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CCJ prml Study note-chapter 1-1: Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

CCJ prml Study note-chapter 1-1: Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support