Some problems needing attention in machine learning algorithm

Source: Internet
Author: User

For the practical use of machine learning. It is not enough to know the level of light, and we need to dig deeper into the problems encountered in the practical. I'm going to make a tidy up of some trivial knowledge points.

1 Data imbalance issues

This problem is often encountered.

Take a supervised study of the two classification problem. We need the annotations of both positive and negative examples. Assuming we get the training data are very few negative examples very many, then directly to do classification is certainly not possible.

The following scenarios are usually required:

1.1 Data Set Angle

Resolve data imbalance by adjusting the ratio of positive and negative samples in the data set by:

1.1.1 Add positive Sample Quantity

Just the sample is less, how to add it? method is to directly copy the existing positive samples thrown into the training set. This can slightly alleviate the dilemma of missing positive samples. But easy brings a problem, that is, the potential danger of overfitting. Because of this rude introduction of positive samples, there is no sample diversity to add datasets. There are some techniques for designing which positive samples to replicate, such as choosing a representative of a particular meaning.

1.1.2 Reducing the number of negative samples

First of all, this is a common and reasonable method, but the decrease of negative samples will inevitably lead to the loss of data diversity. There is a way to alleviate this problem. That is similar to random forest methods, each time the number of positive samples is constant, randomly select the same amount of different negative samples for model training, repeat several times, train multiple models. Finally all the models voted to determine the final classification results.

1.2 Angle of the loss function

The loss function of the model training can be changed again, which makes the loss of the wrong split positive sample larger. The loss of the wrong negative sample is smaller.

So the trained model will have a reasonable inference to the positive and negative samples.

For more information on this topic, please visit:
Experience in solving the problem of data imbalance in classification
The problem of data imbalance in machine learning

2 Outlier handling issues

Speaking of outliers. First of all, we have to talk about the data volume problem. Outliers are not missing values, not error values, the same is true to the performance of the situation, the reason is that a data anomaly, because we can use the amount of data is not large enough to accurately represent the entire distribution of such data. Assuming that the outliers are placed in the context of a large amount of data, the outliers are less unusual.

Download excerpt from a blog from a bull:

The exception value is not an error value, and the same is true behavior, we feel abnormal. Just because our data is not big enough. But judging from the actual industry. Considering the actual computing power and effect, most companies will do the big data "de-noising", then in the process of de-noising is not only noise, but also contains "anomaly", and these "anomaly", just the big data of the broad coverage to reduce, so the use of big data instead of small data more easy to produce convergence phenomenon. Especially for recommender systems, these "anomaly" observations are in fact the ultimate in "personalization".

Since we are talking about big data, the same is true of this Daniel's passage:

To be academic, we'd better think that big data is a powerful inverse of the Bayesian school for the frequency school.

So now that we have this, we'd better think about whether we have the hope of returning to the Bayesian school. Use prior information + small data to complete the counterattack of big data?

Some machine learning algorithms are very sensitive to outliers, for example: K-means clustering. AdaBoost. The exception value must be handled by using such an algorithm.


Some algorithms have characteristics that are insensitive to outliers, such as KNN, a random forest.

How do I deal with outlier values? The simplest way is to simply throw it away. Other methods I will continue to study later.

3 overfitting problem

It's killing me to have a fit. Not easy to train a model. To some test data, the classification results are very poor. Reasons for overfitting:

    • Too little training data
    • The model is too complex
    • There are noise points in the training data (even if the training data is large enough)

Almost all of the machine learning algorithms have easy encounters with the fit problem.

So let's talk about some common approaches to fitting out. Of course, the first thing to ensure is not too little training data.

3.1 Regularization

Regularization is the addition of a penalty factor to the optimization goal of the model. The optimization strategy of this model changes from the experience risk minimization to the structural risk minimization.

    • The regularization of linear regression is the ridge regression and Lasso regression, respectively, corresponding L2,L1 penalty.

    • The regularization of decision trees is pruning. Usually the number of nodes as a penalty.
3.2 Cross-validation

In the case of sufficient data volume, it is possible to use cross-validation to avoid overfitting. It is even possible to do a cross-validation after regularization.

For other specific studies, please click:
Some reasons for over-fitting of machine learning

4 Feature Project issues

There is a sentence that must be put ahead: data and features determine the upper limit of machine learning, and the model and algorithm only approximate the upper limit.

This shows that feature project, especially feature selection, plays a very important role in machine learning.

4.1 What is Feature project

First, drag a paragraph of English definition:

Feature Engineering is the process of transforming raw data into features, better represent the underlying problem to The predictive models, resulting in improved model accuracy on unseen data.

In a word, feature engineering was manually designing what the input x ' s should was.

4.2 Why feature reduction and feature selection

This is mainly due to such considerations as:
1. The higher the feature dimension. A model is easier to fit. The more complex models are not good at this point.
2. The higher the characteristic dimension of mutual independence, the more the model is invariant. The number of training samples required to achieve the same performance on the test set is greater.
3. The number of features added to the training, testing, and storage overhead will increase.
4. In some models. For example, model KMEANS,KNN model based on distance calculation. When the distance is calculated. High dimensions can affect precision and performance.
5. The need for visual analysis.

In low-dimensional cases, such as two-dimensional, three-dimensional, we can draw the data and visualize the data. When the dimension increases, it is difficult to draw.

In machine learning. There is a very classical concept of dimensional catastrophe. Used to describe a narrative when spatial dimensions are added, the analysis and organization of high-dimensional space, due to the addition of volume index encountered a variety of problem scenarios. For example, 100 evenly spaced points can take a unit interval at a distance of less than 0.01 per point, and when the dimension is added to 10. It is assumed that the distance from the adjacent points is not more than 0.01 small squares to sample units over a unit of super-cube. You need to 10^20 a sample point.

It is precisely because of the high-dimensional features as described in the description of a variety of problems, so we need to carry out feature reduction and feature selection and other work.

4.3 Feature extraction

For high-dimensional features (hundreds of dimensions). For example, image, text, sound characteristics, characteristics of each dimension is not significant, it is best to feature first dimensionality reduction. That is, extracting useful information from the initial data. by dimensionality reduction. Map datasets in high-dimensional space to low-dimensional spatial data. Lose as little information as you can at the same time. Or the data points after dimensionality are as easy to differentiate as possible.

In this way, the salient features can be extracted. Avoiding dimensional disasters can also avoid linear correlations between features.

The algorithms commonly used in feature dimensionality reduction are Pca,lda and so on.

PCA algorithm can get the principal component of the data by eigenvalue decomposition of covariance matrix, and take two-dimensional feature as an example. There may be a linear relationship between the two features (e.g. speed of movement and seconds). This results in a redundancy of the second-dimension information. The goal of PCA is to discover the linear relationship between such features. and removed.

The LDA algorithm considers the label, and the reduced-dimension data points are as easy to differentiate as possible.

4.4 Feature Selection

The usual situation is that the feature is not enough. In this case, we need to dig up the features before we design the algorithm.

For logistic regression and decision trees. The characteristics of each dimension are of definite significance. We need to extract all the available information related to the goal from all aspects as a feature.

This process may be more painful.

And then. Is the more characteristic the better? Actually, it's not. Steal a picture come over for example the following:

It can be found that the accuracy of the initial model is added as the number of features is added, and when added to a certain degree it tends to stabilize. Suppose also to forcibly add so many features, instead of the lily, easy over fitting.

And then. Assuming that there are too many features that have been fitted, it is necessary to reduce the number of parameters appropriately.

For logistic regression. The corresponding parameter assumption of a certain dimension is close to zero, which indicates that this feature has little influence and can be removed. Therefore, our feature selection process is generally as follows:

    1. Select as many features as possible and reduce dimensions first if necessary
    2. Select the feature. Retain the most representative features

This process is performed at the same time to observe the change in the model accuracy rate.

Finally, what are the algorithms for feature selection?
-Filter method: To evaluate all the features, select the most effective characteristics. Analogy: Card method test. Information gain, correlation coefficient score.
-Packing method: The choice of feature combination is considered as a search problem in the feature space. Example: Random mountain climbing method. Inspired search methods and so on.
-Embedding method: The process of feature selection is embedded in the model training process, in fact, is a regularization method.

For example, Lasso regression. Ridge regression, Elastic network (Elastic net) and so on.

Specific other details, supplemented later.

Recommend a technical report of the U.S. network:
Review of data cleansing and feature processing in machine learning
Another article:
The problem of feature selection in machine learning
A good article of the last feature selection:
A Introduction on Feature Seclection

Some problems needing attention in machine learning algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.