Tai Lin Xuan Tian • Machine learning Cornerstone

Source: Internet
Author: User

Tai Lin Xuan Tian • Machine learning Cornerstone

Yesterday began to see heights field of machine learning Cornerstone, starting from today refine

First of all, the comparison of the basis, some of the concepts themselves have already understood, so no longer take notes, a bit of the impression is about the ML, DL, ai some of the links and differences, ml mainly want to learn from the data/approximation of an ideal function f (x)

The second talks about the PLA, the Perceptron Learning algorithm, and proves the convergence theorem of the linear time-sharing perceptron (neural Networks and learning machines proof of the same Haykin, plus the learning rate is not necessarily 1, as long as the non-negative can be, But this place also has some questions, in the Haykin of the PDF also marked out, specifically, Michelle Machine learning a book and said the learning rate can not be too large. ), also said the linear non-tick pocket algorithm, is a greedy algorithm.

The third lecture also has a variety of machine learning problems, two classification (non-problem), multi-classification, regression (there are already many statistical knowledge can be used), structural learning (can be understood as a complex multi-classification problem, such as natural language processing of the problem of part-of-speech tagging, due to indefinite sentence length, the category may have infinite variety) The third lecture has not been finished, the back again said:

Batch Learning v.s Online learning v.s Active Learning (when label is Expensives)
RL is often done online

5/19/2016 11:46:44 PM

Lecture IV

Learning is doomed if any f can happen. In other words, if you want a machine to learn to gain generalization beyond the training data set, you must impose a constraint on F.
These ideas are pretty good. This transition to PAC learning theory
Using sample from the whole to learn about the howding inequality (how to prove that it seems to be a very basic inequality in statistics, more basic than the law of large numbers)
Then will marble bin problem and learning connection, classic! In general, based on howding inequalities and IID, it is not seen by the seen push

thereby having

Formal Description:

Note that the above discussion is limited to a fixed set of assumptions (see)

At the end of the slide above it says, like Perceptron learning, it makes a choice, feeding it different data and getting different lines without getting the same line every time ... This sentence is not very understanding. So what we do now is not learning, the thing we do now is verification (like)

A topic:

Learning a little bit vaguely ...

Connection to Real learning
This lesson is about the connection to real learning issues (the last lesson was to link the marble bin to a single hypothetical function, at best verification, and here it is associated with a number of hypothetical functions)
The following picture means that if there is a HM, which is perfect on the top of the samples, do you want to choose it?

In fact, when the hypothesis set gets bigger, bad samples (or I think the odds of being mistaken can get bigger, and the class example is the chance of losing 5 coins with full positives, and when the number of coins is 150, it will be greater than 99%, That is, the probability of choosing an ordinary coin to be mistaken for a positive coin will worsen, which is only a intuition understanding and should be seen in math formulation.

Note that this is bad samples, that is, if you do more experiments, then the more prone to bad samples, not your previous understanding, that is, to do only one sample. But in practice, we will only do a sample bar, that is, the training set is fixed, and assume that the function is diverse, so to see, discuss bad sample meaning seems to be small ah??

No, it really makes sense, the above paragraph of the understanding did not catch the essence, should refer to the above picture, meaning: Indeed, only once did the sample, but if there are multiple hypothesis, it is likely to appear in the case of a situation just on the sample (note, Really only one sample) performance perfectly. It's like tossing a coin, the example of tossing a coin is really good! However, it is necessary to say that the example of the coin is rather narrow, because multiple hypothesis may not necessarily have the same eout, allowing Eout to be different, but bad sample still appears, Coins This example corresponds to the fact that all hypothesis have the same eout, but as an example, enough to refute the "multi-hypothesis case, blindly choose the best performance hypothesis" strategy!

So

When the choice is more, the probability of bad sample will become bigger!!!

Then look at bad data for many H

It's really supposed to be using the Union definition (not 1-p (...) in your mind.) ^M) because, as mentioned earlier, only one sample is done, and the probability of dirty data or bad data in the bin should be computed with a union-defined rationale.

(see here, have to socialize exclamation, this set of lessons in learning theory is too good, more meticulous than CS229, in-depth! Expect more surprises. It takes time to prove the howding inequality later! )
So here we can say that when the set of functions is assumed to be finite (equal to m), the machine can be learned!

But what to do when M is infinite (for example, the Perceptron, it is obvious that the number of separate super-planes is infinitely multiple), it will take two to three lessons to tell the infinite hypothesis set.
A good question:

The last sentence, the next time, we will use these properties (overlapping cases) to explore the infinite hypothesis set situation.

A summary:

In the words of the forest: first "sensational", said the machine can not learn, the civil servants like to learn the problem; and then say that when sample and total data satisfies some specific case (IID) and imposes constraints on the hypothetical function, the machine can be learned by PAC, The howding inequality of a fixed hypothesis is used to explain it, and then, when hypothesis has a limited number, it can still be learned, and finally the infinite hypothesis set is left to be discussed later.
Classic!!!

To summarize briefly, the fourth talk about the use of marble bin combined with howding inequality of the idea of a single hypothesis and finite hypothesis can learn!

5/20/2016 11:44:51 PM

Start proving howding Inequalities! Refer to the Zhang Yiyuan simple Book Blog
http://freemind.pluskid.org/slt/vc-theory-hoeffding-inequality/
The blog first talked about the limited VC dimension is learnable (from the Caltech ml course), rather than the number of hypothesis mentioned in the Heights field, which should be discussed behind the forest.
A glance, the details have not been absorbed, but the basic feeling can understand, speak very well, mainly has a heights field of knowledge of the course, ah, nice, tomorrow the car can see! Wait until the Heights field about infinite hypothesis learning problem, can also go back to look at, even not too troublesome words to Caltech that part of the course to see.

5/21/2016 12:20:10 AM Goodnight Future

Then look at Zhang Yiyuan's book, proving that the process has been understood (to absorb its proof of thought, roughly speaking is to shrink), leaving the question is

    1. The relationship between Chebyshev and Huffman inequalities
    2. The original thesis of howding inequality
    3. Blog comments in the CMU recommended learning materials (oh, my God, this information is so great!!!) Link
    4. After completing the course of heights field, I'll see it again.

began to look at howding original paper, too ugly ... Look at the statistics of the CMU (36-705 CMU Intermediate Statistics notes2), awesome! After reading the first two parts, the third part of the bounded difference inequality has not seen. The derivation of the front from Markov to Chebyshev to Howding is very small and fresh and smooth.

5/21/2016 11:20:08 PM

36-705 CMU Intermediate Statistics

Course description

This course would cover the fundamentals of theoretical statistics.
We'll cover chapters 1–12 from the text plus
Some supplementary material.
This course is excellent preparation
In statistics and machine learning.

Oh, my God, look at this lesson description, as a high-end introduction to ML MUST SEE AH!!!!!
Consider the 36-705 and heights fields of the course of synchronization learning!!

Heights Tian, the four-story.

Lecture Five (recap: fourth, the use of the Huffman inequalities and marble Bin's thought demonstrates that if sample and the overall data are IID and impose constraints on the assumed function (single hypothesis and finite hypothesis) then the problem is learnable) today is about Infinity hypothesis

The context of the past four classes is briefly described.

Hypothesis small or big each has pros and cons:

The following tasks may take up to two (3 hours) of time to be clear (demonstrating that PLA is indeed available for learning)

When M tends to infinity, using the union bound to calculate will obviously fail, because in fact, many of the hypothesis bad data will overlap:

You can try to divide an infinite number of hypothesis by class, rather than by the number of points:
At one point, there are only 2 lines in this world.

At two points, there are only 4 kinds of lines

Three points, 8 kinds of lines

But there must be eight of them (the last sentence), there are only 6 common lines

Four points, 14 lines (comparison general, not considering collinear)

So

That's good! The feeling slowly with the VC dimension pulls the relationship! One question:

Said the future lectures will formally prove why is 22 first, a little meaning ... Expect

A new concept (that is, the hypothesis after the overlap is considered), consider using dichotomies to replace M in the Huffman inequality

Since the size of the dichotomies is related to the input, a growth function is defined

First consider the simpler growth function (Perceptron is more troublesome)

or this

The order of magnitude of the above two dichotomies is satisfactory, because it is less than the exponential attenuation of the howding inequality.

Consider the following case, the effect seems to be +1 with convex set up

Can be found that it can shatter 2^n points

In front of a total of four kinds of growth function, in the end perceptron is a polynomial or index, the next lesson to strictly prove.

Let's start with a simple little definition of the growth function.

Then there is the conjecture about the speed at which the break point and the growth function grow
(conjecture), the next lesson will prove.

?? The answer to the following question is not contradictory to the conclusion of the previous chart, growth function
The speed of growth is not square.

A summary of the five words:

Talk Roadmap

K If it is break point, then k+1 ... It's all. So does break point have a stronger limit on how many dichotomies are produced?

If break point is 2, then n=3 can produce up to 4 dichotomies. Wait, there seems to be some mathematical law in the inside, you can think about ...?

So

Topic

bounding function

It's easy to fill out half of the tables, but the rest is the big one because it involves n>k.

I take a go, the purpose of the next question is to show that the bouding function is very loose, the equal sign does not necessarily take, or to specific analysis of the problem, originally thought MH (4) is 15 ...

Below to fill in the rest of the bounding function table, oh my God, which reminds me of dynamic planning ... :


The evidence behind it is quite exciting, and it is using the idea of dynamic programming.

For the example of B (4,3), first classify:



So

Generalized:

So get, finally said actually less than equals can be equal, oneself class can go to prove. I have time to prove it.

The third Perceptron MH (n) does not write, but it doesn't matter, because MH (n) can always only be
A break point bound live

The next question, although option 4 did not give a rigorous mathematical proof, but I think so.


......

The final video says that with the foundation above, it's not easy to replace N with MH (n), and when n is large, it actually becomes

The proof is very technical, so just tell me about the various parameters (2,2,1/16, etc.). Have an interest in class to prove yourself in detail??

But I don't even know sketch of proof, Andrew. N.G said he saw a week of VC-dimensional proof that this is the place (a toss-up found, 36705 in Lecture3 said 36702 has proved, and 36705 is 36702 of the first course, it seems to want to fully understand really is not simple
Thing) ... Here are three steps to explain the origins of 2,2,1/16, respectively:



At this point, it proves that the two-dimensional perceptron can be learned!

This talk at the end of a problem, from which you can see that this bound is not very small, that is to say bad chance is not small, because we in the derivation of VC bound used a lot of approximate in the inside. Next lesson we will discuss the VC bound since it is not so tight, why we have to spend so much effort to deduce it, please look forward to the next lesson.

Summary, summed up very well, with B (n,k) bound in MH (n) by break point, and then proved that B (n,k) is polynomial, and then with a pictorial (because the theory is too difficult, so can only be not strictly speaking) proof with MH (n) instead of M, The introduction of the classic VC bound, proved that 2D PLA can learn, and finally through a problem found VC bound is not very tight, VC bound the significance of suspense left to the next talk:

After listening to the end of the hearing found a lot of supplement materials:

    1. 36705
    2. 36702
    3. Statistical inference (36705 textbook)
    4. The teacher recommended Vapnik statistical learning theory (SLT)

The seventh talk about VC dimension



The last inequality uses the conclusion of the map above


Whatever algorithm or distribution or objective function is chosen, the probability is guaranteed to be eout close to Ein (even if the Ein is large).

Problem, there are n points can be shatter can be.

The following section proves that the VC dimension of the D-dimensional linear classifier is d+1, proving to be ingenious, by proving Dvc>=d+1 and dvc<=d+1 by constructing a matrix (discussed at the teacher's group meeting)








The following is the physical meaning of VC dimension




More in-depth understanding of VC dimension

The cost of the model complexity:

Powerful H is good:

Sample complexity, theoretically 10,000 times times the VC dimension actually only needs 10 times times, the VC bound is very loose

The reason for easing is because:

However, the bound of the VC is equally loose on all models, so there is comparability, moreover, the philosophical meaning behind VC bound for ML is significant (do not blindly pursue high VC dimension)

Title: Congratulations on the initial mastery of VC bound

Summary:

At the end of the picture, the next talk of VC bound extended to a wider learning problem, more than noiseless binary classification

Eighth Talk noise and error




said that even if there is noise, given a fixed x corresponding to the y may be different, the same use of the previous marble bin idea (just here the ball color equivalent to change, is a chameleon), the derivation of the replication VC bound, the same can be obtained this noise model (probabilistic Marbles) under VC bound is still established. I think so!

See here asleep, noon did not sleep. Next look.
With a noise model (probabilistic model), the purpose of learning is no longer just to approach F (x), but the end:

VC bound will still work, then pocket algorithm will also work, meaning that as long as learning
The learning algorithm can make the Ein as small as possible, then the eout will be small.

Problem:

Error measure




Because of the different error measure, the algorithm gets the best g is also different, so the error measure to join learning flow

VC bound still works for different problem (classification or regression) and different error measure, the specific mathematical proof is more complicated, so I will not elaborate.

Here is a question:

Then you should choose the error measure, first from the fingerprint identification:

In specific application scenarios, such as supermarket shopping discounts, false reject and false
The severity of the impact of accept on supermarkets is different, so as an improvement, the two are given different penalties:

Instead of the CIA, it became:

But in practice, it is almost impossible for users (supermarkets, the CIA) to quantify the true error (because supermarkets and the CIA don't know exactly how many times it should be used to indicate the severity of the difference), So in practice, the plausible or friendly (easy to solve algorithm) method is often used, so that err^, not err, is the estimation or approximation of err. See.

Therefore, when evaluating the similarity of G and F, err is used, but really for the algorithm, I
They're giving err^.

One question:

Does pocket still work?

Virtual Replication 1000 times (actually corresponds to (1000 times times the probability) more frequently accessed)



Summary

More ML algorithms will be learned in the future

Nineth Lecture Linear regression

If Eout gets better, then learning will happen.

A better explanation for eout than the VC dimension.

How does the last trace below prove??

This one doesn't read?

The learning curve of linear regression is similar to VC bound

Topic:

Feel Lin's linear regression is more in-depth, but also to review later.

can be used linear regression to do binary classification, the following discussion



From the VC bound view, if reduce regression ein, then classification eout will also become smaller, so, can use regression to do classification

Problem:

Some of the above error measure will be used later
Summary:

Tenth talk about logistic regression

The Epiphany is:

    1. The difference between the logistic regression and the binary classification is that the logistic regression wants to know whether "there is cancer" but "what is the probability of getting cancer".
    2. The purpose of using the logistic function is very simple, in fact, is to map the original WTX to 0, 1 to represent the probability.
    3. Another very important thought history: The logistic regression is "learnable" is the theory support of the front VC bound (VC bound extended to contain noise target of probablistic case, At that time using probabilistic marble to explain)

In fact, the eighth of the first two videos of the VC bound promotion respectively. The first video is the last 3rd said, consider if there is noise under the probablistic target situation, VC bound still work. The second video says error measure, which means it's not limited to 0/1 wrong measurements (PLA), VC bound still work when using many broader hypothesis and error measure.

So, the same linear model, how does the error of the logistic regression be defined?



Some trimming, get:

Topic:

The following is the minimization of the objective function:

Derivation

Not like linear regression, no closed-form solution

So

Choosing different n,v will get different iterative optimization algorithms.

Topic:

Using Taylor approximation


How to Choose N

After improvement:


Topic:

Summary:

All three of the preceding methods are linear (linear classification,linear
Regression,logistic regression), the next time they are integrated into more complex classification problems (multi-classification).

11th talk Linear models for classification

It is discussed that several linear models mentioned earlier Pla,linear regression, the logistic regression is used for the possibility of classification, the PLA is needless to say, mainly linear Regression and logistic regression, through the use of VC bound can prove regression regression and logistic classifier is learnable, Because their error is larger, so they constitute bound than the PLA loose. So ... This talk is relatively simple, ppt I just once let go





Topic:

Using random processes to replace the average



Topic:

Multi-Category:
Using the one aganist all method, there may be areas that cannot be distinguished

The workaround is to use the soft method, using the probability (logistic) instead of the non-zero-one hard two classification

There will be a problem with unbalanced data, and there is also a dedicated multinomial logistic regression (with the use of logistic regression through one aganist all constructs
Out of the multi-classifier difference is the latter probability and not necessarily 1)

Topic:

To continue discussing the unbalance of OVA, you can use Ovo and then vote:


Problem, easy:

Summary:

Talk about nonlinearity later

12th Lecture Nonlinear transformation

Broadly speaking is feature transformation, here in a narrow sense into a nonlinear classifier, in fact, is kernel thought, directly:








In fact, is the course at the beginning of the feature transformation thought, such a powerful method, really works? (The next video will tell the price)

Problem:

Cost of calculation and storage:

The cost of model complexity

Generalization problem, rely on "people see" can?

People see in fact is "cheating" (the human brain subconscious has a big VC dimension)

Topic:

As the number of nonlinear transformation increases, the VC dimension increases, and the hypothetical space is progressively included (somewhat Zhang Zhihua the meaning of the incremental measurable space mentioned in the random variable 1 of the second lesson of statistical machine learning ...). )

This structured hypothesis sets, the following conclusion is logical, high frequency (VC dimension) may not be good, even if the Ein small, but the VC bound loose, eout instead big.

The choice model starts from the linear model, here also embodies the heights razor principle, in this course in the field of VC, ein,eout ideas speak very well.

Topic:

Summary:

Next talk will go on dark side of the Force (also known as powerful nonlinear transformation)

13th Lecture Hazard of Overfitting

"To retreat" even if the data was previously known to have been generated by the 10-time polynomial +noise

Explanation: When the amount of data in the gray area, H10 eout than H2

Even in the case of 50-time polynomial and no noise generated data, H2 is still better, explained by the fact that model complexity plays a role in noise (not understood, later)

Topic:

So when is overfitting going to happen? Keep doing some detailed experiment.


The influence of noise and complexity on overfit leads to the course logo



Topic:

How deal with Overfitting, an image of metaphor

Data cleaning/pruning

Data hinting

Topic:

Summary:
Overfitting: Noise,data is not enough, power is too much, and if target is too complex it can be regarded as a noise (deterministic noise). Then we have a few ways to deal with it, the cleaning of data /pruning,hinting, the next one will talk about regularization.

14th Lecture Regularization

Easy Direct







Topic:

Good explanation.

The analytic solution can be obtained (when linear regression), called Ridge regression

The following method of the regularization item is very new, is a bright spot!


Lead to Legendre polynomials, reason not too understand??

Topic:

Regularization after VC dimension problem:

Eaug is better proxy

Feeling through the regularization of the Eaug played a head-on role, the process of minimizing Eaug seems to consider the entire DCV (H), but in fact only considered Deff (H,a), that is, effective VC dimension.

Topic:

In the broader sense of regularizers, the idea of choosing a good regularizer, similar ideas also appear when choosing err measure:

The emphasis is on explaining why L1 applies sparsity, because the gradient of Ein and Regularizer are uneven, and the best solution is "pulled" up to the apex.

As you can see from the two figures below, the larger the noise, the better the lambda will be, as if the road is uneven, and the brakes (Regularizer) are a bit more. In addition, from the right figure to be able to see, ideal function should be 15 times, so 15 times there is no deterministic Noise. But noise not know in advance, how to choose the best lambda, sell a xiaoguanzi, next talk again (estimate is validation)

Topic:

Summary:

15th Lecture Validation

So many models to learn

Model selection is important and may be the most important issue in ML. Can not use "eyes" to choose, the reason is 1, high-dimensional data cannot "see" 2, the human brain subconscious very high VC dimension

Cannot be selected with best ein

With best etest is cheating

But you can use Eval to leave some training data as Test,legal cheating:-)

Topic:

Eout (gm*) is a model that is trained on the whole D after gm*-is chosen.

K's Dilemma, N/5 is usually chosen as the validation set

Title, it seems that validation may not be time consuming (no 25n^2):

Leave a cross-validation:

Theoretical guarantee: Not read??

Topic:

Disadvantages of leaving a cross-validation: Hard to calculate, unstable

V-heavy cross-validation


Topic:

Summary:

16th three.


Ames Razor



The following is a novel idea of probability interpretation.

Topic:

Sampling deviation sampling bias:




The bank loan problem at the beginning of the course, the data is actually biased

Topic:

Data snooping should be avoided as far as possible,

Sometimes it is not limited to looking at the eyes, indirectly seeking the "peek" of statistical data, such as shrinking, not applied to validation data

The longer he tortured the prisoner, the easier he could confess.

Dealing with data snooping (vigilance, beware of peeking)

Topic:

Cornerstone Summary: Power of three



Validation equivalent to the model's rematch


The next three directions

Topic:

Summary:

End of Cornerstone

6/1/2016 7:37:45 PM

Tai Lin Xuan Tian • Machine learning Cornerstone

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.