11.1 What to do first

11.2 Error Analysis

Error measurement for class 11.3 skew

11.4 The tradeoff between recall and precision

11.5 Machine-Learning data

**11.1** **what to do first**

The next video will talk about the design of the machine learning system. These videos will talk about the major problems you will encounter when designing a complex machine learning system. Also try to give some advice on how to cleverly build a complex machine learning system. The following course tells you something that is not mathematically strong, but very useful, and may save a lot of time when building large machine learning systems.

This week, consider a spam classifier algorithm as an example. To solve such a problem, the first decision to make is how to choose and express the Eigenvector x. We can choose a list of the 100 most commonly used words in spam, depending on whether they appear in the message, to get our eigenvector (appearing as 1, not appearing as 0), and the size is 100x1. To build this classifier algorithm, we can do a lot of things, such as:

1. Collect more data and let us have more spam and non-spam samples

2. Message-based routing information develop a complex set of features

3. The development of a series of complex features based on the message body information, including the processing of the truncated words

4. Develop complex algorithms for detecting deliberate spelling errors (writing watch as W4tch)

Among the options above, it is very difficult to decide which item to spend time and effort on, making wise choices that are better than the way you feel. When we use machine learning, we can always "brainstorm" and come up with a bunch of ways to try it. In fact, when you need to brainstorm different ways to try to improve your accuracy, you may have surpassed a lot of people. Most people do not try to list the possible methods, they only wake up one morning, for some reason there is a whim: "Let's try the Honey Pot project to collect a lot of data." "

**11.2** **Error Analysis**

In this course, we will talk about the concept of error analysis. This will help you to make a more systematic decision. If you're going to study machine learning, or build a machine learning application, the best practice is not to build a very complex system, to have a complex variable, but to construct a simple algorithm so you can implement it quickly.

Whenever I study machine learning problems, I only spend a day, literally 24 hours, trying to get the results out quickly, even if they're not working. Frankly speaking, there is no complicated system at all, but it is just a quick result. Even if the operation is not perfect, but also run it again, and finally through cross-validation to test the data. Once done, you can draw a learning curve, draw a learning curve, and test the error to find out if your algorithm has high deviations and high variance, or other problems. After this analysis, it is useful to decide whether to train with more data or to add more feature variables. The reason for this is this: it's a great way to get into a machine learning problem, and you don't know in advance whether you need complex feature variables, or whether you need more data, or something else. Knowing what you should do in advance is very difficult, because you lack the evidence and the learning curve. Therefore, it is difficult to know where you should spend your time to improve the performance of the algorithm. But when you practice a very simple, even imperfect approach, you can make further choices by drawing a learning curve. You can avoid the problem of premature optimization in computer programming in this way: we have to use evidence to lead our decisions, how to allocate our time to optimize the algorithm, rather than just intuition, and intuitively, things are usually wrong. In addition to drawing the learning curve, a very useful thing is error analysis, I mean: when we construct the spam classifier, I will take a look at my cross-validation data set, and then personally see which messages are incorrectly categorized by the algorithm. So, with these spam and non-spam messages categorized by the algorithm, you can find some systematic rules: What type of mail is always incorrectly categorized. Often, this process can inspire you to construct new feature variables, or tell you: the weaknesses of the system, and then inspire you to improve it.

The recommended method for building a learning algorithm is:

1. Starting with a simple, fast-to-implement algorithm, this algorithm is implemented with cross-validation set data testing

2. Draw the learning curve to decide whether to add more data, or more features, or other options

3. Error Analysis: Manual Check Cross-validation sets the example of the prediction error in our algorithm to see if these instances have some systematic trend

With our spam filter as an example, the error analysis should be done by verifying that cross-validation concentrates all of the messages that our algorithm generates to predict whether these messages can be grouped by class. such as pharmaceutical spam, counterfeit spam, or password-stealing mail. Then look at which group of messages the classifier has the biggest prediction error and begin to optimize. Think about how to improve the classifier. For example, find out if certain features are missing, and note the number of occurrences of those features. For example, record how many times the error spell occurred, how many times the abnormal message routing situation occurred, and so on, and then start the optimization from the most frequently occurring situation. Error analysis does not always help us to judge what action should be taken. Sometimes we need to experiment with different models and then compare them, and use numerical values to determine which model is better and more effective when comparing models, and we usually look at the error of cross-validation sets. In our spam classifier example, "Should we treat discount/discounts/discounted/discounting as the same word?" "If this can improve our algorithm, we will use some truncation software." Error analysis can not help us to make such judgments, we can only try to adopt and do not use the cut Word software two different scenarios, and then according to the results of the numerical test to determine which is better.

Therefore, when you are constructing the learning algorithm, you will always try a lot of new ideas, to achieve many versions of the learning algorithm, if each time you practice new ideas, you have to manually test these examples, to see whether the performance is poor or good, then it is difficult for you to make a decision. Whether or not to use stemming, is case-sensitive. But with a quantitative numerical evaluation, you can look at this number, whether the error is getting bigger or smaller. You can quickly practice your new ideas through it, and it's basically very intuitive to tell you that your idea is to improve the performance of the algorithm or make it worse, which will greatly improve the speed at which you practice the algorithm. So I strongly recommend that you implement error analysis on a cross-validation set rather than on a test set. However, there are some people who will do error analysis on the test set. Even if this is mathematically inappropriate. So I recommend that you do error analysis on cross-validation vectors.

To summarize, when you are studying a new machine learning problem, I always recommend that you implement a simpler, faster, even less perfect algorithm. I've hardly ever seen people do this. What you often do is that you spend a lot of time building algorithms and building the simple methods they think. So don't worry about your algorithm being too simple, or too imperfect, to implement your algorithm as quickly as possible. Once you have the initial implementation, it becomes a very powerful tool to help you decide what to do next. Because we can first look at the error caused by the algorithm, through the error analysis, to see what he has done wrong, and then to determine the way to optimize. The other thing is: Suppose you have a fast and imperfect algorithm implementation, and a numerical evaluation data, which will help you to try new ideas and quickly find out whether the ideas you are trying to improve the performance of the algorithm, so that you will make a faster decision, what to give up in the algorithm, absorb what

Error analysis can help us systematically choose what to do.

**11.3** **error measurement of class skew**

in the previous lesson, I mentioned the error analysis and the importance of setting the error metrics. That is, set a real number to evaluate your learning algorithm and measure its performance, with the evaluation of the algorithm and the measurement of the error. One important thing to note is that using an appropriate error metric, which can sometimes have a very subtle effect on your learning algorithm, is an important thing to do with the skew class (skewed classes) problem. Class skew is manifested in our training

There are very many instances of the same kind in practice, with few or no other classes. For example, we want to use algorithms to predict whether cancer is malignant, and in our training centers, only 0.5% of the cases are malignant tumors. Suppose we write a non-learning algorithm that predicts that the tumour is benign in all cases, then the error is only 0.5%. However, the neural network algorithm we get through training has a 1% error. At this time, the size of the error can not be regarded as the basis of the evaluation algorithm.

**Precision ratio** (Precision) and **recall** (Recall) We divide the results of the algorithm predictions into four scenarios:

1. **correct affirmation** (true POSITIVE,TP): Prediction is true, actual is true

2. **correct negation** (True negative,tn): Prediction is False, actual is true

3. **Error Positive** (False POSITIVE,FP): Prediction is true, actual is false

4. **Error negation** (false NEGATIVE,FN): Forecast False, actual false:

Precision =tp/(TP+FP) example, the higher the percentage of patients with malignant tumors that we predict, the better.

Recall =tp/(TP+FN) example, in all patients who actually have malignant tumors, the percentage of patients who have successfully predicted malignant tumors, the higher the better.

In this way, the recall is 0 for our algorithm, which always predicts the patient's tumour as benign.

**11.4** **the tradeoff between recall and precision**

in the previous lesson, we talked about precision and recall rates as an assessment measure of the problem of skew. In many applications, we want to be able to guarantee the relative balance of precision and recall ratios. In this lesson, I'll show you what to do, and also show you some precision and recall ratios as a more efficient way to evaluate metrics as an algorithm. Continue to follow examples of the nature of tumors that have just been predicted. Assuming that our algorithm outputs a result between 0-1, we use the threshold 0.5来 to predict true and false.

Precision (Precision) =tp/(TP+FP) example, the higher the percentage of patients with malignant tumors that we predict, the better the better.

Recall rate (Recall) =tp/(TP+FN) example, in all patients who actually have malignant tumors, the percentage of patients who have successfully predicted malignant tumors, the higher the better.

If we want to predict true only in very certain cases (tumors are malignant), that is, we want higher precision ratios, we can use larger thresholds than 0.5, such as 0.7,0.9. In doing so, we will reduce the error of predicting the condition of the patient as a malignant tumor while increasing the failure to predict the tumor as malignant.

If we want to improve recall, as much as possible to allow all patients with malignant tumors to be further examined and diagnosed, we can use a smaller threshold than 0.5, such as 0.3.

We can chart the relationship between recall and precision in different thresholds, the shape of the curve is different depending on the data:

We would like to have a way to help us choose this threshold. One method is to calculate the **F****1** **value** (F1 score), which is calculated as:

We choose the threshold that makes the F1 value the highest.

**11.5**
**machine learning data**
in the previous video, we discussed the evaluation indicators. In this video, I'm going to switch a little bit and discuss another important aspect of machine learning system design, which often involves how much data is used for training. In some of the previous videos, I told you not to start blindly, but to spend a lot of time collecting lots of data, because data is sometimes the only thing that actually works. But it turns out that under certain conditions, I'll talk about what these conditions are in this video. Getting a lot of data and training in some kind of learning algorithm can be an effective method to obtain a good performance learning algorithm. This often happens when these conditions are true for your problem and you can get a lot of data. This can be a good way to get a very high performance learning algorithm. So, in this video, let's discuss the issue together.

Many, many years ago, two researchers I knew, Michele Banko and Eric Brill, had an interesting study that tried to differentiate common confusing words by machine learning algorithms, and they tried many different algorithms and found that the amount of data was very large. These different types of algorithms work well. The next thing we want to explore is when we want to get more data, rather than modifying the algorithm.

in general, consider first the question, "Can a real-life expert predict the outcome with confidence in the face of these traits?" "If the answer is yes, we need to think about what our model is." If the algorithm is highly biased and the cost function is small, increasing the amount of data in the training set is unlikely to lead to overfitting, which makes the gap between cross-validation errors and training errors smaller. In this case, consider getting more data.

It can also be understood that we want our algorithm to be low variance, low deviation, we reduce the variance by selecting more features, and then reduce the deviation by increasing the amount of data.

NG Lesson 11th: Design of machine learning systems (machines learning system designs)