This column (Machine learning) includes single parameter linear regression, multiple parameter linear regression, Octave Tutorial, Logistic regression, regularization, neural network, machine learning system design, SVM (Support vector machines Support vector machine), clustering, dimensionality reduction, anomaly detection, large-scale machine learning and other chapters. All of the content came from Standford public class machine learning in the lecture of Andrew. (Https://class.coursera.org/ml/class/index)
The seventh lecture. --machine Learning system design for machine learning systems
===============================
(i) Decisions on basic strategies
(ii), error analysis
☆ (three), to skewed classes set up error Metrics
☆ (iv) Tradeoff between precision and Recall (accuracy and recall)
(v) Selection of machine learning data
===============================
(i) Decisions on basic strategies
In this chapter, we use a practical example < how to do spam spam classification > to describe the design method of machine learning system.
First we look at two emails, a spam Spam on the left and a non spam non-spam on the right:
Looking at its style can be found that spam has a lot of features, then we want to build a Spam classifier, we need to conduct supervised learning, Spam features extracted, and hope that these features can well distinguish Spam vs. Non-spam.
As shown in the following illustration, we extract the feature of deal, buy, discount and now, and establish such a feature vector:
Here, please note: In fact, for the spam classifier, we do not manually select 100 of seemingly spam feature feature as a feature, but choose the spam word frequency of the highest 100 words instead.
Here's what this section focuses on--how to decide the basic strategy, some ways that might help classifier work:
Collect a large amount of data-such as "honeypot" project from the email route set up a more complex feature--such as the sender cheapbuying@bug.com to the message body to establish a complex and accurate feature library- If the discount and discounts should be regarded as the same word to establish algorithms to check spelling errors, as feature--as "Med1cine"
Of course, not all of these strategies work, as shown in the following exercises:
===============================
(ii), error analysis
We often get confused at the beginning of a ML algorithm design and what kind of system to use. How to build the model, how to extract feature ...
Here, we recommend a way to build an ml system:
Use at most one day, 24 hours of time to implement a simple algorithm, logistic regression or linear regression or, with easy features rather than carefully explore which features more effective. Then, test on the Cross-validation dataset, using the method of drawing learning curves to explore whether it is beneficial to the system to have more data sets or to add more features; Error Analysis: The above has tested the system performance on the Cross-validation data set, now, we artificially to see which data caused the big error generation. Whether you can reduce the error by changing the systematic trend.
Or with spam-classifier examples, let's look at the steps for error analysis:
After the simple system was established and tested on the CV set, we carried out the error analysis step, dividing all the spam into pharma,replica/fake,steal password and others, these four categories. Find some features that might help improve the classification effect. As shown in the following illustration:
Here, we do not want to think emotionally, but the best to use the number to reflect the effect. For example, whether discount/discounts/discounted/discounting is considered to contain discount this feature problem, we do not think subjectively, but to see if all contain this feature, then the result is 3 % of the error, if not all as a discount this feature, then there are 5% of the error, it can be seen which method is better.
PS: Introduce a software porter Stemmer, can Google to, is the discount/discounts/discounted/discounting as a similar software.
The same is true for whether capitalization is treated as one feature.
===============================
(iii) Establishment of error Metrics for skewed classes
In some cases, classification-accuracy and classification-error cannot describe the whole system, for example, for the following skewed Classes.
What is skewed classes? A classification problem, if the result is only two types of y=0 and Y=1, and one of the samples is very many, the other is very few, we call the class of this classification problem is skewed Classes.
For example, the following question:
We use a logistic regression as a model to predict whether the samples are cancer patients. The results of the model test on cross-validation set show that there is a 1% error,99% correct rate of diagnosis. In fact, only 0.5% of the samples were real cancer patients. So we build another algorithm to predict:
function Y=predictcancer (x)
y=0; % that ignores the effects of feature in X
Return
Well, this way, the algorithm predicts all of the sample to be non cancer patients. Then only 0.5% of the error, pure from Classification-error, than we do before the logistic regression stronger, but in fact we know that this cheat method is only trick, can not be used for practical use. Therefore, we introduce the concept of error metrics.
Consider a two-point problem in which the instance is divided into a positive class (positive) or a negative class (negative). For a two-point problem, there are four different situations. If an instance is a positive class and is also predicted to be a positive class, it is a true class (true positive), and if the instance is a negative class that is predicted to be a positive class, it is called a false positive class (false positive). Correspondingly, if the instance is a negative class that is predicted to be a negative class, called a true negative class (true negative), a positive class is predicted to be a negative class and a false negative class (false negative).
TP: Number of correct affirmations; FN: false negatives, number of matches not correctly found; FP: false positives, given a mismatch is incorrect; TN: mismatched logarithm correctly rejected;
This allows you to create an error Metrics (left below) and define precision and recall, as shown in the following illustration:
I can also refer to my original article about the ROC curve.
Precision: Correct prediction of positive samples/all of my predictions as positive samples;
Recall: Correct prediction positive sample/True value is positive sample;
When and only when the precision and recall are high we can be sure that the PREDICT algorithm work well!
OK, let's take a look at the original algorithm that all sample is predicted as Non-cancer, here, Tp=0,fp=0, Fn=1, tn=199 (assuming 200 sample)
Because of tp=0, so precision=recall=0. The un-avaliable of the algorithm is proved.
Therefore, whether a class is skewed Classes, as long as the precision and recall are very higher can guarantee the practicability of the algorithm.
Exercises, do the next look:
Finally, we need to remind you of the question of which side is true and which is false. For the above problem, given the cancer of true, in practical applications, we should specify in binary classification that a class with fewer sample types is true and the other is false. This must not be mistaken.
===============================
(iv) Trade-offs between precision and Recall (precision and recall)
The definition of precision and recall is given in the previous section, in which we make a trade-off between the Precision-recall by drawing a change in the relationship between the two.
For a prediction problem, suppose we use the following method to predict:
There is a threshold=0.5 here.
Under
Different threshold have the following two types of situations:
If we want to tell the patient that there is cancer, that is to say, do not give the patient too much fright, I tell you there is cancer, you certainly have cancer; I told you there is no cancer, you may have cancer, then there are: higher Threshold,higher Precision,lower Recall If we do not want patients to miss early treatment, contrary to the previous example, there are: Lower Threshold,lower Precision,higher recall
Here, if you do not know the error can be metrics to draw a look.
So we can draw it out Precision-recall figure:
Different data, its curve form is different, but one rule is invariable:
Thres high precision low recall;
Thres low precision high recall;
☆ So in different algorithms or different threshold caused by the {Precision,recall}, how we choose that algorithm is better.
Add the data we now have three algorithms (or threshold):
Visible, in Algorithm3, recall=1, namely predict all y=1, this obviously violates our original intention. Look at the judging criteria below. Using p to express precision,r expression recall;
If we choose the criterion = (p+r)/2, then algorithm3 win, obviously unreasonable. Here we introduce an evaluation standard: F1-score.
When p = or r=0, there is f=0;
When P=1&&r=1, there is f=1, the largest;
Similarly, we apply F1 score to the above three algorithms, and the results are ALGORITHM1 largest, which is the best; algorithm3 the least, the worst. So we use F1 score to measure the performance of an algorithm, that is, the trade-off between precision and recall.
Practice, do it ~ (this is slightly weaker):
===============================
(v) Selection of machine learning data
For machine learning, we can choose a lot of different algorithems for prediction, such as:
Visible, as the training set rises, the accuracy will generally be improved, but in fact it is not all the same.
For example, if I just give you the size of the house, I don't have a house in the center or in a remote area. How much intervention. and other information, we are unable to make good predictions.
Here is how to properly deal with training data and problems.
Remember in the last lecture we have introduced the definitions and differences between bias and variance, and here we look at their production environment:
Bias:j (train) Large, J (CV) Large, J (train) ≈j (CV), bias produced in D small, underfit stage;
Variance:j (train) Small, J (CV) Large, J (train) <<j (CV), variance from the D-large, overfit stage;
To ensure that the bias is small, it is necessary to ensure that there are enough feature, that is linear/logistics regression many parameters,neuron networks should have a lot of hidden layer. To ensure that the variance is small, it is necessary to ensure that no overfit is generated, then a lot of data set. The need for J (train) and J (CV) is small to make J (test) relatively small.
As shown in the following illustration:
To sum up, the results of the data and rational analysis are two:
First, there are enough feature in X to get the low bias;
Secondly, there is a large enough training set to get the low variance;
Exercises:
==============================================
This chapter is important to describe the machine learning how to design machine learning system, involving machine learning methods, strategies, algorithms, I hope we firmly grasp, in order to reduce unnecessary waste of time.
About machine learning more study material will continue to update, please pay attention to this blog and Sina Weibo sophia_qing.