Before the Big machine learning Cornerstone Course, which is used in the textbook is "Learning from data", recently looked at the feel good, intends to go into the deep, the content and Taiwan Big Course poor, but some points speak more deeply, Want to know what is in the course of the children's shoes can see my previous about the course of the first chapter summary list:
- Machine learning definition and PLA algorithm
- Classification of machine learning
- The possibility of machine learning
I plan to read while reviewing the handouts, and then use their own words to each chapter can read the points summarized under, welcome to correct me. As the language will try to understand the oral English, so there will be a lack of rigor, specific details can read this book. "Learning from Data" The first chapter mainly solves two questions:
- The Machine learning algorithm is introduced by an example.
- A rough proof of why machine learning is feasible?
about 1thAlthough the example PLA is relatively simple, there is basically no usability, but it is quite suitable as the first machine learning algorithm to give beginners the basic impression. The understanding of this algorithm lies in the iterative inequalities, why this equation can be continuously corrected until the final convergence? The ppt of the Cousera course has a very intuitive explanation: when y=+1, it means that X and W (that is, the normal vector of the line, assuming W toward the positive side) of the angle is less than 90 ℃, if there is a point it with the angle of the W is greater than 90 ℃ the point is divided wrong, do a vector addition can be made toward the X-ray angle to reduce the direction (question 1, )。 after the direction of the iteration is no problem, the rest is why it converges? Proving that the idea is to construct a final perfect WF (i.e. a straight line that can completely split the data set), and then prove that W finally has a 0 angle to the WF after the finite wheel, and the concrete proof can be seen here. 2 Former PLA example is to give beginners an intuitive image, what is the machine learning algorithm, personally think the first chapter of the most essential place is the 2nd, that is, a rough proof of machine learning is possible. first, what is machine learning? From the book's title we can get a simple explanation, refers to the
machine from the data learning , this is very critical, sometimes not learning but memory, for example, people learn mathematical formula, because the formula is already a definite theoretical results, has been tested by practice, that is not learning but memory, and then understand , and learning is its reverse process, that is to ask the person or machine to deduce such a formula, so machine learning prerequisite condition is the data. Then the algorithm and the like. with the data, we will start to learn. Assuming that all the data in D, we get the data used for training in D, the so-called learning, essentially get a hypothetical function g, this g in the sample DataSet D on the error rate is very small, and more importantly, it in the complete D error rate is also very low, so that we can use G to predict the future of the situation, Which means we learned G. Strictly speaking, machine learning is accompanied by two processes, one is learning to solve G, the second is validation. In other words. Here are two questions we need to address:
- How does this hypothesis g get? --Learning
- How to ensure that our g in the complete D error rate is very low? The complete collection of D is infinitely large, how do we know? --Verification
The first problem is that machine learning algorithms need to know that they are chosen from a bunch of assumptions. How to choose? Of course we can do the best on our D, so we get the G. In fact, this problem is not so simple, because there is a second problem. How to verify that the obtained G has low error rate in the complete works? Does the error rate on D have anything to do with the error rate on d? This is called G's generalization ability. We can only resort to probability statistics, the magic formula is hoffeding inequality. As shown below: in the form of a formula, it can be shown that the probability that the difference in the error rate of H on D and D is within a certain range, we just need to make the probability of the greater the better. From the formula, we can see that as long as n is larger, this can be guaranteed, and it is important to see how much the data is. Here are two invisible conditions:
- h is fixed, that is to say, the formula is only pre-obtained H has effect;
- D of the extraction, must be able to reflect the probability distribution of D, this is very natural, if D can only represent a part, then we do not learn all, there will be many have not seen the appearance, how to learn ... , so the general is random extraction, rather than manually see which data pleasing to choose which, this and intuition is consistent, we do the sampling is based on this principle;
The problem comes again, the
machine learning training process is to determine the data, and then from the vast (for the moment to understand as M) the hypothesis of the best choice, so that the hoffeding inequality can not be used directly? of course not, it took so much effort to introduce the magic formula in the end can not be used not very miserable. First clear, what is the problem? We want to prove that: there is a limit to the amount of this thing, so that G can have a guaranteed error rate on the complete d. What's the condition? We have an M hypothesis, this is predetermined, because the machine learning algorithm has been set to learn, and our chosen G is from the M, so we can get: in the probability that an event a can be pushed to export another event B then the probability of its occurrence is less than or equal to B, so the sum of the final m probabilities is obtained: This limit is very loose, and many of the assumptions of machine learning algorithms seem to be infinite, like the PLA, that's all straight lines ... This question is the second chapter to prove that the conclusion is that there is a tighter on-line to ensure that G's error rate is close to D and D. With this, we can safely choose the hypothesis of the least error rate as G, because the error rate can be generalized to the complete works. Here again, then we go to the wrong rate of 0 on the line, but to find the hypothesis how many spare tires ah, the more spare tire m larger, the generalization ability down, it is possible to choose the smallest not necessarily in the complete D performance good (examples can be seen here the coin flip Experiment), it is clear that everything is to balance.
Learning from Data Chapter I. Summary