large "machine learning Cornerstone" course experience and summary---Part 1 (EXT)

Source: Internet
Author: User



Finally the end of the final, look at others summary: http://blog.sina.com.cn/s/blog_641289eb0101dynu.html



Contact Machine Learning also has a few years, but still only a rookie, when the first contact English is not good, do not understand the class, what things are smattering. After learning some open classes and books on the go, I began to understand some concepts gradually. It is said that the experience must be written down to remember to live. It is said that the experience should be shared with others. Here are some of their own very superficial feelings or problems encountered, not necessarily, please bring your own filters. There are Daniel's see please point out the mistake, beg for a pat, ask for guidance.


First of all, to talk about  big This course, the overall partial theory some. Originally is to learn to learn practical application on the line of mind to listen to, and later still feel can not only know its why not know it. After hearing, really have a sense of mastery, such as ... Before I even regularization about what the meaning of the inside of the Lambda and C is what relationship is not clear ... Now it's clear, though not necessarily. The teacher of the lecture is very good, also very responsible for personally answer the questions of students, recommend everyone listen to. Unfortunately, there are two jobs are not completed, the certificate cannot get (because ...) In the middle of the marriage, you did not see the wrong, this flag I dare to plug in, but learning to things is the most important, at least with a variety of professional vocabulary mixed face ripe, listen to other courses or reading books when not at a loss. Here's a summary of some of the things you learned from this course (something more miscellaneous). The ability to apply machine learning can be met in several features:
    1. There are pattern, such as how banks decide whether to send credit cards to users.
    2. This pattern is not very clear, there is no definite formula or process (to be used directly, but also to learn something)
    3. To have data, no data from where to learn ...
What the machine learns to do is by using the learning algorithm, based on the training data (the data is from the target function), to assume the set H to find the most suitable hypothesis H, with H to approximate the target function. The target function is the magical thing that produces all kinds of data.  Model of machine learning M Odel = Learning algorithm + hypothesis set (algorithm + assumed collection)============================================================================================ the first half of the class is about whether learning is possible, Specifically how to do, do not write (must be able to ah, otherwise what is said later). Take away message that is, to learn to be possible, the difference between eout (Test) and Ein (Training) should not be too large. There is a thing called the Hough Inequality, which illustrates the relationship between the Eout,ein and the hypothesis set H complexity:
    • The higher the complexity of H, the smaller the Ein, but the eout is likely to be large. (Overfit)
    • In turn, the H complexity is not enough, the Ein may be large, but the difference between Ein and eout may be small. (Underfit)
 From the point of view of data volume n, to learn to be possible, in theory n needs to be 10,000 times times the DVC, but in practice, 10 times times as usual is OK. DVC is a measure of the complexity of the assumed set H (the greater the value, the greater the ability to fit the complex data). in practice, the value of DVC is roughly equal to the degree of freedom (but not always; for example, if you have 100 feature and 99 degrees of freedom, you need at least 1000 data).  For feature, there are several types: concrete specific: There is a clearer physical meaning and is related to the problem (domain knowledge) Raw primary: only some of the simpler physical meaning abstract: no physical meaning ... ================================================================================Pla,linear Regression, Logistic Regression. Three can actually be used to do classification, each has its advantages and disadvantages. See Courseware 11 for details. Before doing pla,logistic regression, first use linear regression to find a W, as the initial value. Linear irreducible data, pocket PLA and Linear Regression, we are more inclined to use the latter in practice. The Logistic regression output value is between 0 and 1 and belongs to the linear classification. This is not the same as the classification probability.   ================================================================================Sometimes, our model because Underfit performance is not good, we need to add more feature, this can be other feature or have feature transformation, such as join the feature of the polynomial. But if we add too hard, we may overfit. For example, the target function is generated by a 50-time polynomial, and you use a 10-time polynomial to model it, it may be overfit. In fact, even if the target function is generated by a 10-time polynomial, you can use the 10-time polynomial, or you will overfit. In courseware, it is better to use two-time polynomial modeling instead. this actually feels more like a philosophical question: the Way of heaven, the loss of more than the lack of compensation. The truth is that it is not enough to win. ---"Nine yin Canon"。 As the saying goes, too, this is the truth. So when you try the model, start with the simplest linear model linear model, and even if you want to add a polynomial item, start with the lower order. There are three conditions for the appearance of overfitting:
    1. The data set is too small to be learned in a few points.
    2. The noise of the data is too large (stochastic noise, random noise)
    3. The model used is too complex (this is also a noise, called deterministic noise)
    4. The model is too complex relative to the data (called excessive power, which can be merged with the 3rd one)
============================================================================================ of course, if we just want to use a more complicated model , or not, it is necessary to join regularization. This allows both the ability to enjoy complex models and the ability to make the model less capable.  The function of the regularization is step back from the complex hypothesis to the simple ones (for example from H10 to H2). The relationship between H10 and H2 (starting from W0):
    1. H2 = H10 starting at 3 the coefficient is 0 (this seems to appear to be several times at once);
    2. Relaxation conditions-H10 any 3 coefficients are not 0, the remainder is 0;
    3. Continue to relax the->h10 coefficients of the square and less than C (WTW <= C) (so that you can enjoy H10 's ability/complexity, and not too much).
Although we choose a more complex model H10, However, you can add regularization by setting up lambda. Remember, from a very small start plus. a little regularization goes a long to add more may underfit (specific reference courseware). ============================================================================================ many times, we have tried more than one model, But by many kinds. So how can you make a choice? The only thing that needs to be used here is validation. Remember, what we're going to do is model Selection, and model = algorithm + corresponding hypothesis Set. To do validation:
    1. We need to divide the data into three parts (training, testing, and testing three datasets). Each model is trained on the training data, and the best hypothesis G is chosen from its own hypothesis set as a representation of this set of assumptions.
    2. Then, delegates to the test data on the test results, and finally we choose the best performance on the test data of the G corresponding to the model M.
    3. The training and test data are then combined to allow M to get a final hypothetical g* on the combined data as a result of the final approximation to the target function.
    4. So how does this g* behave, and we can measure it on the test data as a judge of g* ability.
Here's a question that has been tangled before. In the second step we select the best performing g and its corresponding model M. But the third step, the training and test data merge, re-use M to get a g*. The question is, is it possible that the other models will perform better on the merged datasets? This is actually a trap. You want to know if the other models are better on the merged data, then we need to find another test set to measure, but we don't ( the test data is absolutely impossible to move Oh! , and we cannot measure it with the performance of the training data (because of this overly optimistic estimate). If we can find new test data, then the process can go on indefinitely, and we can always think of whether it is better to combine the other models ... So, we have no way to know the answer to the above question. Theoretically speaking, g* is better than G in theory because it is learned from bigger data. ============================================================================================ The last three points guiding ideology:1, simple is good. Boulevard to Jane. 2, careful sampling Bias, class matches exam. Your training data and test data are derived from the same distribution. To think about it, really? For example, bank credit card problem: The bank's data is the people who give credit cards to the bank, whether they have a disorderly spending money. But we do not know the people we reject, if they really give, they will not spend money indiscriminately. So this data is already filtered out. And the real application of the time, there will be no such a distinction, so there are bias (how to solve the not said, check the literature it). That is to say, we also have to consider the question of which aspects of the data should be covered, and whether the existing data is biased. 3, do not data snooping, you are in the brain of the complexity added to the model. What's more, people habitually take the data set and do a exploratory analysis to see what the statistics are. But if the data used in the analysis process contains your test, then there is a possibility of indirect data snooping. In short, no matter what you do, please split the dataset into train and test and do it again, and only on train. Once you have identified the scheme, apply the same scheme to test (such as selecting features or transformations on train). ============================================================================================part 1 that's it.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.