Public Course address:Https://class.coursera.org/ml-003/class/index
INSTRUCTOR:Andrew Ng
1. Prioritizing what to work on: Spam classification example ( Spam Classification System )
I have learned some theoretical knowledge and diagnostic methods in the practical process. This section analyzes a practical problem.-Spam classification system. I believe most of them have been usedEmailPeople know what spam is, and it is also a deep pain point for spam. If you do not know what spam is, please refer to the following:
The left side is obviously a spam email. first look at the strange name of the email box, and then look at the various misspelled words in the message content. It is estimated that this is not what people write but what computers produce. In contrast, the mail on the right is non-spam mail. In order to distinguish between spam, the first thing we need to do is to find some characteristics that can mark spam. After finding these features, we can identify spam by performing supervised classification.
As for features, we can start with words:
As shown above, you can select100To obtain100Dimension vector, and then check whether these words appear from the spam. If they appear, mark the corresponding position1In this way, each spam email can correspond to one100Dimension vector. However, to be more accurate,100Words are obviously not enough. To improve accuracy, you can use the following methods:
Translated:
Collect a large amount of data, apparently
Starting from the mail routing information, establish more complex features, such as the sender's mailbox
Create a complex and accurate feature library for the mail body, for example, whetherDiscountAndDiscountsSame word
CreateAlgorithmCheck spelling errors, such as spelling incorrect words for med1cine
2. Error Analysis ( Error Analysis )
Now that you know how to do it, then you can take action. HereAndrew NgThe professor gave his own opinions:
The translation is as follows:
Implement a simple algorithm as quickly as possible, whether it is logical regression or linear regression, first use simple features, and then test on the validation dataset;
Using the Method of Drawing learning curves to study whether adding data or adding features is more advantageous to the System
Error Analysis: manually check which data has resulted in errors. Is there a trend between error generation and samples?
After SIMPLE algorithm implementation and verification, we perform error analysis on the model to classify spam into four types.(Pharma, replica/fake, steal passwords, other):
Now we can consider whether we should regard different forms of some words as the same. Here we should not take them for granted, but determine by comparing the error rate.DiscountThe various deformation judgments of this word should be considered as a word:
3. error metrics for Skewed classes ( Skew Error Measurement )
What isSkewed classesWhat about it? A classification problem. If there are only two types of resultsY = 0AndY = 1,In addition, there are many samples in one class and few samples in the other class.Skewed classes.For example, if you want to determine whether a patient has cancer, the error rate is1%(That is, Prediction1%But the actual situation is only0.5%In contrast, it is only possible to predict the error rate of no one will get cancer.0.5%.
It can be said that, from the classification error rate alone, it is better than our previous logical regression. In fact, we know that this method is just a trick and cannot be used in practice. Therefore, the concept of error measurement matrix is introduced:
Consider a binary problem, that is, dividing an instance into a class ( Positive ) Or negative class ( Negative ). For a binary problem, there are four situations. If an instance is a positive class and is predicted to be a positive class, it is a real class ( True positive ) , If the instance is a negative class that is predicted to be a positive class, it is called a false positive class ( False Positive ). Correspondingly, if the instance is a negative class, it is predicted to be a negative class ( True negative ) , If the positive class is predicted as a negative class, it is a false negative class ( False Negative ) , Summarized as follows:
TP: The correct number of positive values;
FN: The number of matched items is not found;
Fp: False positive. The matching result is incorrect;
TN: Non-matched logarithm correctly rejected;
So we can get the matrix on the left and define the accuracy.(Precision)And recall rate(Recall ).
Precision: Correct prediction of positive samples/All predicted positive samples;
Recall: Correct prediction of positive samples/The actual value is positive;
For more information about these two values, see here:Http://en.wikipedia.org/wiki/Precision_and_recall
4. Trading off precision and recall ( Weigh accuracy and recall rate )
With the previous definition of accuracy and recall rate, the following two values can be used as the standard for determination. We still need to consider the previous issue of cancer diagnosis. There is a threshold selection problem here. When the probability is higher than what, we can judge as cancer? If the threshold value is set to a large value, for example99%, High accuracy, but the recall rate is very low, on the contrary, the accuracy is very low, the recall rate is very high. Our goal is to make the two as big as possible, so there is a trade-off.
It can also be analyzed as follows: if we want to tell patients with cancer only when we are confident, that is to say, don't scare the patients too much. I will tell you that you have cancer, and you must have cancer; I told you there is no cancer, and you may also have cancer. In this case:High threshold, high precision, and low recall rate. If we do not want patients to Miss early treatment, the opposite is true for the above example:Low Threshold, low precision, and high recall rate.
Since we need to weigh the two values, how can we weigh them? Maximum mean of the two?
In the preceding three algorithms, if the mean value is used for calculation3Winning, actually Algorithm3It is obviously incorrect to predict true for all situations.F1-ScoreValue, you can find that whenPAndRIt is guaranteed only when the values are as big as possible.F1scoreIt is as big as possible, but it cannot be guaranteed in a case that is big or small.F1scoreMaximum. It can be said that when selecting an algorithm, you only need to findF1scoreThe algorithm with the largest value.
5. Data for Machine Learning ( Machine Learning data )
In machine learning, many methods can be used to predict the problem. Generally, when the data size increases, the accuracy will be higher:
Of course, this is not always the case. We have mentioned that adding training samples in some cases does not improve the prediction accuracy. We do not consider this situation for the time being. We need to consider whether there is enough data to make predictions for the model. If there is not enough data, even human experts cannot make predictions, so why can we expect machines to make accurate predictions?
To ensure a small deviation, we need to ensure that there are enough features. To ensure a small variance, we need to ensure that there is no overfitting, so we need a lot of training sets. Here we needJ (Train)AndJ (CV)Only smallJ (test)Relatively small.
-------------------------------------------------Weak split line----------------------------------------------------
This section describes how to design a specific spam classification system, and introduces an important concept of an error measurement matrix. When measuring the quality of an algorithm, we should not only look at the error, but should consider the size of the accuracy and recall rate, which is a trade-off. However, for convenience, we got these two values together.F1scoreIn the future, you only need to judge whether the algorithm is good or bad.F1scoreYou can.