Stanford Machine Learning---seventh lecture. Machine Learning System Design

Source: Internet
Author: User

Original: http://blog.csdn.net/abcjennifer/article/details/7834256

This column (machine learning) includes linear regression with single parameters, linear regression with multiple parameters, Octave Tutorial, Logistic Regression, regularization, neural network, design of the computer learning system, SVM (Support vector machines), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content is from the Standford public class machine learning in the teacher Andrew's explanation. (Https://class.coursera.org/ml/class/index)

Seventh Lecture. Machine Learning Systems Design--machine Learning system Designs

===============================

(i), determine the basic strategy

(b), error analysis

(iii), to skewed classes set up error Metrics

(iv), trade-offs between precision and Recall (accuracy and recall rates)

(v), machine learning data selection


===============================

(i), determine the basic strategy

In this chapter, we use a practical example < How to classify the spam spam > to describe the design method of the machine learning system.

First we look at two messages, the left is a spam Spam, the right is a non-spam non-spam:

Observe its style can find that spam has a lot of features, then we want to build a Spam classifier, we need to have supervised learning, will Spam features extract, and hope these features can well distinguish Spam vs. Non-spam.

As shown, we extract deal, buy, discount, now and so on feature, to establish such a feature vector:

Here please note: In fact, for the spam classifier, we do not manually select 100 seemingly spam feature feature as a feature, but choose spam morphemes frequency of the highest 100 words instead.

Here's what this section focuses on-how to determine the basic strategy, and some of the ways that might benefit classifier work:

    • Collect large amounts of data-such as "Honeypot" project
    • Create more complex feature--from email route (e-mail protected)
    • Create a complex and precise feature library of message text-if discount and discounts should be treated as the same word
    • Establish algorithms to check spelling errors as feature--as "Med1cine"

Of course, these strategies do not all work, as shown in the following exercises:

===============================

(ii), Error analysis

We are often confused in the initial phase of an ML algorithm design, what kind of system should we use? How to build a model, feature how to extract, etc...

Here, we recommend a way for you to create an ML system:

    • With at most one day, 24 hours of time to achieve a simple algorithm, logistic regression or linear regression or, use simple features rather than carefully explore which features more effective. Then, test on the cross-validation data set;
    • Using the method of painting learning curves to explore, whether the data set more or join more features is beneficial to the system work;
    • Error Analysis: The above has already tested the system performance on the Cross-validation dataset, and now we manually see what data is causing the big error? Can I reduce the error by changing the systematic trend?

Or with Spam-classifier For example, let's take a look at the steps for error analysis:

    • After the simple system was established and tested on CV set, we performed the error analysis step, dividing all spam into pharma,replica/fake,steal password and others, which are four classes.
    • Find some features that may help improve the classification effect.
As shown in the following:

Here, we do not want to think emotionally, but the best use of numbers to reflect the effect. For example, whether discount/discounts/discounted/discounting is considered to contain discount this feature problem, we do not want to think subjectively, but to see if it contains this feature, then the result is 3 % of the error, if not all as a discount this feature, then there are 5% of the error, this shows which method is better.

PS: Introduction of a software Porter Stemmer, can be Google to, is discount/discounts/discounted/discounting as the same kind of software.

The same is true if the case is treated as the same feature.

===============================

(iii), to skewed classes establish error Metrics

In some cases, classification-accuracy and Classification-error cannot describe the overall system, such as the following skewed Classes.

What is skewed classes? A classification problem, if the result has only two classes of y=0 and Y=1, and one class of samples is very numerous and the other is very small, we call the class of this classification problem skewed Classes.

For example, the following question:

We use a logistic regression as a model to predict whether samples is a cancer patient. The results of the model test on cross-validation set show that there is a correct diagnostic rate of 1% error,99%. And in fact, only 0.5% of the samples were real cancer patients. In this way, we build another algorithm to predict:

function Y=predictcancer (x)

y=0; % that ignores the effects of feature in X

Return

Okay, so, the algorithm predicts all of the sample to be non-cancer patients. So there is only 0.5% of the error, purely from Classification-error view, than we did before the logistic regression stronger, but in fact we know that this cheat method is only trick, can not be used as a practical use. Therefore, we introduce the concept of error metrics.

Consider a two-point problem in which instances are divided into positive classes (positive) or negative classes (negative). There are four scenarios for a two-point problem. If an instance is a positive class and is also predicted to be a positive class, that is the true class (true positive), if the instance is a negative class that is predicted to be a positive class, it is called a false positive class (false positive). Correspondingly, if the instance is a negative class that is predicted to be a negative class, called the true minus class (true negative), the positive class is predicted as a negative class, which is the false negative class (false negative).

TP: The number of correct affirmations; FN: false negatives, numbers of matches not found correctly; FP: false positives, the mismatch given is incorrect; TN: Mismatched logarithm of correct rejection;

This allows you to create an error Metrics (left) and define precision and recall as shown in:

You can also refer to my original article on Roc curves.

Precision: Correct prediction of positive samples/All my predictions are positive samples;

Recall: Correct prediction of positive sample/real value is positive sample;

When and only when precision and recall are high, we can be sure that the PREDICT algorithm work well!

OK, let's take a look at the original algorithm that predicted all of the sample as Non-cancer, here, Tp=0,fp=0, Fn=1, tn=199 (assuming a total of 200 sample)

Because of tp=0, so precision=recall=0! The un-avaliable! of the algorithm is proved.

Therefore, whether a class is skewed Classes, as long as satisfies precision and recall are very well paid can guarantee the practicability of the algorithm.

Exercises, do the following:

Finally, we need to remind you of the question of which side as true, and which side as false. For the above problem, given cancer true, in practice, we should specify in binary classification that the class in which sample is less is the same as true and another class as false. This must not be mistaken!

===============================

(iv) Trade-offs between precision and Recall (accuracy and recall rates)

The definition of precision and recall is given in the previous section, in which we make a trade-off between the two by drawing the relationship between Precision-recall.

For a prediction problem, suppose we use the following method to predict:

There is a threshold=0.5 here.

Under

Different threshold have the following two types of cases:

    • If we want to be assured that the patient has cancer, that is, do not give the patient too much fright, I tell you there is cancer, you must have cancer; I tell you there is no cancer, you may also have cancer, then the case is: higher Threshold,higher Precision,lower Recall
    • If we do not want the patient to miss early treatment, in contrast to the previous example, there are: Lower Threshold,lower Precision,higher recall

Here, if you don't know, you can draw the error metrics to see it.

So we can draw it out. Precision-recall diagram:

Different data, its curve form is different, but one rule is invariable:

Thres High correspondence high precision low recall;

Thres Low correspondence low precision high recall;

☆ So how do we choose that algorithm between {Precision,recall}, which is caused by different algorithms or different threshold?

Join us now with three algorithms (or threshold) of data:

Visible, ALGORITHM3, recall=1, that is predict all y=1, this is clearly contrary to our original intention. Look at the judging criteria below. Use p to denote precision,r expression recall;

If we choose the judging standard = (p+r)/2, then algorithm3 wins, obviously unreasonable. Here we introduce an evaluation criterion: F1-score.

When P=0 or r=0, there is f=0;

When P=1&&r=1, there is f=1, maximum;

Also we apply F1 score to the above three algorithms, the result is algorithm1 maximum, which is the best; algorithm3 is the smallest, which is the worst. So we use F1 score to measure the performance of an algorithm, which is what we call trade-off between precision and recall.

Practice, do it (this is slightly weaker):

===============================

(v), machine learning data selection


For machine learning, we can choose many different algorithems for prediction, such as:

It can be seen that as the training set rises, accuracy will generally be improved, but in fact it is not entirely so.

For example, if I just give you the size of the house, and no house in the city center or in a remote area? How much intervention? And so on, we can't make good predictions.

Here is how to properly handle the training data and problems.

Remember in the last lecture we have introduced the definitions and differences between bias and variance, and here we look at the environment in which they are produced:

Bias:j (train) Large, J (CV) Large, J (train) ≈j (CV), bias produced in D small, underfit stage;

Variance:j (train) Small, J (CV) Large, J (train) <<j (CV), variance produced in D, Overfit stage;

    • Want to ensure that bias small, it is necessary to ensure that there are enough feature, that linear/logistics regression there are many parameters,neuron networks should have a lot hidden layer neurons.
    • To ensure that the variance is small, it is necessary to ensure that no overfit, then a lot of data set. The need for J (train) and J (CV) is small to make J (test) relatively small.

As shown in the following:

In summary, the results of the data and rational analysis are two:

First, there are enough feature in X to get low bias;

Second, there is enough large training set to get low variance;

Exercises:

==============================================

This chapter is about how to design machine learning systems in machine learning, involving machine learning methods, strategies, and algorithms, in the hope that we can grasp them firmly to reduce unnecessary waste of time.

Stanford Machine Learning---seventh lecture. Machine Learning System Design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.