Review of data cleansing and feature processing in machine learning

Source: Internet
Author: User

A survey of data cleansing and feature processing in machine learning with the increase of the size of the company's transactions, the accumulation of business data and transaction data more and more, these data is the United States as a group buying platform of the most valuable wealth. The analysis and mining of these data can not only provide decision support for the development direction of the American business, but also indicate the direction of the business iteration. At present, the group purchase system is widely used in machine learning and data mining technology, such as personalized recommendation, sorting, search, user modeling, etc., for the company to create a great value.

This paper mainly introduces the method of data cleaning and feature mining in the recommendation and individualized team practice of American Regiment. The main content has been in the internal public class, "machine learning Inaction Series", the content of this blog is mainly the refinement and summary of the lecture content.


As shown in is a classic machine learning problem frame diagram. The work of data cleansing and feature mining is the first two steps in the box in the gray box, namely "Data cleansing and feature, callout data generation and model learning + + model application".

The blue arrows in the gray box correspond to the offline Processing section. The main tasks are:

1. Purge feature data and callout data from raw data, such as text, images, or application data.

2. The cleaning features and labeling data processing, such as sample sampling, sample tuning, anomaly removal, feature normalization, feature changes, feature combinations and other processes. The resulting data is primarily used for model training.

The green arrows in the gray box correspond to the parts that are processed online. The major difference between the main work and the offline process is that 1. There is no need to clean the callout data, only the feature data needs to be processed, and the online model uses the feature data to predict the possible labels for the sample. 2. The end result is the use of data, and the resulting data is primarily used for model predictions rather than training.

In the offline Processing section, we can do more experiments and iterations, try different sample sampling, sample weight, feature processing method, feature combination method, and so on, finally get an optimal method, after the off-line evaluation to get good results, the final plan will be determined online use.

In addition, because of the different online and offline environments, there are significant differences in the methods of storing data and acquiring data. For example, offline data acquisition can store data in Hadoop, perform analytics processing in batches, and tolerate certain failures. and the online service obtains the data to need to be stable, the delay is small, can build the data into the index, the storage system and so on KV. Later in the corresponding section will be described in detail.

Based on the example of click Rate Prediction, this paper introduces how to perform data cleaning and feature processing with examples. First of all, the next click on the single rate forecast task, its business goal is to improve the user experience of group buying users, help users find the list they want to buy faster and better. This concept, or the target, looks rather virtual, and we need to transform it into a technical goal that is easy to measure and implement. The final technical goal is to click on a single rate estimate to predict the probability of a user clicking or buying a purchase order. We will predict the click or the next single-rate high list in front, the more accurate the prediction, the user in the order of the list click, the more the order, save the user repeatedly paging overhead, quickly can find their desired list. Offline we use commonly used to measure the results of the AUC index, online we through the abtest to test the algorithm to the next single rate, user conversion rate and other indicators of the impact.

Feature usage Scenarios

After we have identified the goals, we need to determine what data to use to reach our goals in the next step. It is necessary to comb which feature data may be related to whether or not the user clicks on the order. We can draw on some business experience, in addition, we can use some feature selection, feature analysis and other methods to help us choose. Specific feature selection, feature analysis and other methods we will introduce in detail later. Judging from business experience, the factors that may affect the user's Click order are:

1. Distance, it is clear that this is a very important feature. If you buy a list that is far from the user, the user will have to pay a lot of money to consume the list. Of course, it's not that you don't have a long list of users, but that's a small percentage.

2. User history behavior, for old users, before may have purchased, click and other behavior in the U.S. group.

3. User interest in real time.

4. The quality of the list, the above characteristics are relatively good measure, the quality of the list may be a more complex feature.

5. Whether it is hot, user evaluation number, purchase number and so on.

After determining what data to use, we need to evaluate the availability of the data, including the difficulty of obtaining data, the size of the data, the accuracy of the data, the coverage of the data, etc.

1. Difficulty in data acquisition

For example, getting a user ID is not difficult, but getting a user's age and gender is more difficult because these are not required when users sign up or buy. Even if it is filled, it is not entirely accurate. These characteristics may be predicted by an additional predictive model, which has the problem of model accuracy.

2. Data coverage

Data coverage is also an important consideration, such as the distance feature, which is not available to all users. There is no distance on the PC side, and many users are forbidden to use their geolocation information.

3. User history behavior, only old users will have behavior.

User real-time behavior, if the user just opened the app, there is no behavior, the same face a cold start problem.

4. Accuracy rate of data

List quality, user gender, etc., there will be an accurate rate of the problem.

Feature acquisition Scenarios

OK, after selecting the features you want to use, we need to consider a problem. Where is this data available? Only by acquiring this data can we use it. Otherwise, to mention an impossible to acquire the characteristics, not to get, mention is also Baiti. The following is an introduction to feature acquisition scenarios.

1. Offline feature acquisition scheme

Offline can use a large amount of data, with the help of distributed file storage platform, such as HDFS, such as the use of processing tools such as Mapreduce,spark to deal with massive amounts of data.

2. Online feature acquisition scheme

Online features pay attention to the delay of data acquisition, because it is an online service, it is necessary to obtain the corresponding data in a very short time, the performance of the search is very high, can store the data in index, KV storage and so on. There is a contradiction between finding performance and the amount of data, and we need a compromise, and we use the feature layering approach, as shown in.

For performance reasons. In the coarse stage, the data is built directly into the index using more basic features. Finish the stage, and then use some personalized features and so on.

Features and labeling data cleaning

After you know where the feature data is placed and how to get it. The next step is to consider how to handle features and annotate data. The following 3 sections are the main features and annotation processing methods

Labeling Data Cleaning

Firstly, we introduce how to clean the characteristic data, and the method of cleaning feature data can be divided into two methods: Offline cleaning and on-line cleaning.

Clean data offline

The advantages of off-line cleaning are convenient to evaluate the new feature effect, the disadvantage is the poor real-time, and the online real-time environment has some errors. For real-time features it is difficult to train to get proper weights.

Cleaning Data Online

The advantages of online cleaning are strong real-time, fully recorded online real data, the disadvantage is that new features need to be added for a period of time to do data accumulation.

Sample sampling and sample filtering

Feature data can be used as a model training only after merging with the callout data. The following describes how to clean the callout data. It is mainly data sampling and sample filtering.

Data sampling, for example, for classification problems: Select a positive case, negative example. For regression problems, data acquisition is required. For sampled samples, you need to set the sample weights as needed. When the model can not use all the data to train, the data needs to be sampled, set a certain sampling rate. The sampling method includes random sampling, fixed proportional sampling and so on.

In addition to sampling, it is often necessary to filter the sample, including

1. Data filtering, such as removal of crawler crawl, spam, cheating and other data, in conjunction with business conditions.

2. Anomaly detection, using anomaly detection algorithm to analyze samples, common anomaly detection algorithms include:

Deviation detection, such as clustering, nearest neighbor, etc.

An anomaly detection algorithm based on statistics

For example, very poor, four-bit spacing, mean difference, standard deviation, this method is suitable for mining single-variable numerical data. The Range, also known as the extreme difference, is used to represent the variance in statistics (measures of variation), the difference between its maximum and minimum values, and the four-point distance is typically used to construct a box plot and a brief overview of the probability distribution.

Anomaly detection algorithm based on distance, mainly through distance method to detect anomaly points, the data set and most points distance is greater than a certain threshold point as an anomaly, the main use of distance measurement method has absolute distance (Manhattan distance), Euclidean distance and Markov distance and so on.

The density-based anomaly detection algorithm, which investigates the density at the current point, can detect local anomalies, such as the LOF algorithm.

Feature classification

After the analysis of the characteristics and labeling of the cleaning methods, the following to specifically describe the characteristics of the processing method, the characteristics of the first classification, for different characteristics should have different treatment methods.

According to different classification methods, the features can be divided into (1) Low level feature and high Level feature. (2) stable characteristics and dynamic characteristics. (3) Two value feature, continuous feature, enumeration feature.

Low levels are characterized by lower-level features, primarily primitive features that do not require or require very little manual processing and intervention, such as word vector features in text features, pixel points in image features, user IDs, commodity IDs, and so on. The lower level features a higher general dimension and cannot be used in overly complex models. The higher level feature is a complex process that combines some of the features of business logic or rules and models, such as manual scoring, model scoring, etc., which can be used in more complex nonlinear models. Low level more targeted, small coverage. The predicted values of long tail samples are mainly influenced by the characteristics of high level. The predicted values of high frequency samples are mainly influenced by the characteristics of low level.

The stability characteristic is the characteristic which the change frequency (update frequency) is less, for example the average score of the evaluation, the purchase price and so on, will not change in the longer time period. Dynamic features are frequently updated features, some even real-time computing features, such as distance characteristics, 2-hour sales and other characteristics. Called Real-time features and non-real-time features. For the difference between the two types of features can be targeted to design the characteristics of storage and updating methods, for example, stability features, can be built into the index, a long time to update, if the cache, the cache time can be longer. For dynamic features, the data needs to be updated in real time or in real-time, and if cached, the cache expiration time needs to be set short.

The two-valued feature is mainly 0/1 features, that is, the feature only takes two kinds of values: 0 or 1, such as the user ID feature: whether the current ID is a specific ID, Word vector feature: whether a particular word appears in the article and so on. The feature of continuous value is the characteristic of rational number, the number of features is variable, for example, the distance characteristic, the characteristic value is 0~ positive infinity. Enumeration value characteristics are mainly characterized by a fixed number of possible values, such as today's week, only 7 possible values: Week 1, Week 2, ..., Sunday. In practical use, we may convert different types of features, such as the enumeration feature or continuous feature to two-value features. Enumeration feature processing is a two-value feature tip: Map An enumeration feature to multiple features, each corresponding to a specific enumeration value, such as today's week, which can be converted to 7 two-dollar features: Whether today is Monday, whether today is Tuesday, ..., whether today is Sunday. Continuous value processing is a two-value feature method: Discretization of successive values (how discretization is described later), and then dividing the discretized features into n two-element features, each of which represents whether or not within this interval.

Feature processing and analysis

After the classification of the features, the following describes the characteristics commonly used in the processing methods. Includes 1. Feature normalization, discretization, default processing. 2. Feature dimension reduction method. 3. Feature selection method and so on.

Feature normalization, discretization, default value processing

Primarily used for the processing of individual features.


Different features have different range of values, in some algorithms, such as linear model or distance-related models like cluster model, KNN model, etc., the value range of the feature will have a greater impact on the final result, for example, the value range of the two-yuan feature is [0,1], and the distance feature value may be [0, positive infinity], In practice, the distance is truncated, for example [0,3000000], but the two characteristics due to inconsistent range of values caused the model may be more inclined to the characteristics of a larger range of values, in order to balance the characteristics of inconsistent range of values, the characteristics need to be normalized, the feature value is normalized to [0, 1] interval. Common normalization methods include 1. function normalization, mapping function to map feature values to [0,1] intervals, such as the maximum minimum value normalization method, is a linear mapping. There are also mappings through nonlinear functions, such as the log function. 2. Fractal dimension normalization, you can use the maximum minimum normalization method, but the maximum minimum value is the maximum minimum value of the owning category, that is, the use of the local maximum minimum value, not the largest global minimum value. 3. Sort normalization, regardless of the original characteristics of what the value is, the features sorted by size, according to the order of the characteristics corresponding to give a new value.

Discretization of

The value space for the continuous values described above may be infinite, and for the sake of representation and processing in the model, continuous-value features need to be discretized. The common discretization methods include equivalence division and isometric Division. The equivalence division is to divide the characteristics by the range, and the value of each segment is treated equally. For example, the range of values for a feature is [0,10], which we can divide into 10 segments, [0,1], [up], ..., [9,10]. The isometric division is divided evenly according to the total number of samples, each of which is classified into 1 segments. For example, distance characteristics, the range of values [0,3000000], now need to cut into 10 segments, if according to the proportional division, will find that most of the samples are in the 1th paragraph. The use of isometric partitioning would avoid this problem, the final possible shard would be [0,100], [100,300], [300,500], ..., [10000,3000000], the front of the interval is relatively dense, followed by sparse.

Default value Handling

Some features may be missing because they cannot be sampled or have no observations, such as distance characteristics, users may not be able to access geographic locations or fail to obtain geographic location, which requires special handling of these features, giving a default value. There are many ways to give the default value. For example, individual representations, majority, average, etc.

Feature Dimension reduction

Before introducing feature dimensionality reduction, the feature ascending dimension is introduced first. In machine learning, there is a VC-dimensional theory. According to the VC dimension theory, the higher the VC dimension, the stronger the beating ability, the higher the model complexity can be tolerated. In the low-dimensional irreducible data, the mapping to the high dimension is to be divided. Can think of, give you a pile of objects, how the human brain classifies these items, still is to find some characteristics of these items, such as: color, shape, size, touch and so on, and then according to these characteristics of the items to be categorized, which is actually a first ascending dimension, after division of the process. For example, our brains recognize bananas. Maybe first we find that the banana is yellow. This is a shard in the color of this dimension. But a lot of things are yellow, such as cantaloupe. So how do you differentiate between bananas and cantaloupe? We found that the banana shape is curved. And Cantaloupe is round, then we can use the shape to divide the banana and hami melon, that is, to introduce a new dimension: shape, to distinguish. This is an example of a "color" one-dimensional feature ascending to a two-dimensional feature.

That's the problem, since the model can become stronger after the dimension, then is the feature dimension higher the better? Why feature reduction & feature selection? Mainly for the following considerations: 1. The higher the feature dimension, the easier it is to fit the model, and the more complex model is not good at this time. 2. The higher the characteristic dimension of mutual independence, the greater the number of training samples required to achieve the same performance on the test set in the case of the model being unchanged. 3. The cost of training, testing, and storage will increase as the number of features increases. 4. In some models, such as model KMEANS,KNN based on distance calculation, when the distance is calculated, the dimension is too high to affect the accuracy and performance. 5. The need for visual analysis. In low-dimensional cases, such as two-dimensional, three-dimensional, we can draw the data and visualize the data. When the dimension increases, it is difficult to draw. In machine learning, there is a very classic concept of dimensional catastrophe. It is used to describe the analysis and organization of high dimensional space when the spatial dimension increases, and the problem scenarios are encountered due to the increase of volume index. For example, 100 evenly spaced points can take a unit interval at a distance of no more than 0.01 samples per point, and when the dimension increases to 10, if the distance from the adjacent point is not more than 0.01 small squares sampling units over a unit of super-cube, you need to 10^20 a sample point.

It is precisely because of the high-dimensional characteristics of the various problems described above, so we need to carry out feature reduction and feature selection and other work. The algorithms commonly used in feature dimensionality reduction are Pca,lda and so on. The goal of feature reduction is to map datasets in high-dimensional space to low-dimensional spatial data while minimizing the loss of information, or as easily as possible to differentiate the data points after dimensionality reduction.

PCA algorithm

The principal component of the data can be obtained by eigenvalue decomposition of the covariance matrix, and with the two-dimensional feature as an example, there may be a linear relationship between the two features (for example, the speed of movement and the second), which results in the redundancy of the second dimension information. The goal of PCA is to discover the linear relationship between this feature and to remove it.

LDA algorithm

Consider label, the data points after dimensionality are as easy to distinguish as possible

Feature Selection

The goal of feature selection is to find the optimal subset of features. Feature selection can eliminate the characteristics of irrelevant (irrelevant) or redundancy (redundant), so as to reduce the number of features, improve the accuracy of the model and reduce the running time. On the other hand, we choose a truly relevant feature simplification model to help understand the process of data generation.

The general process for feature selection is as follows:

Mainly divided into the production process, evaluation process, stop conditions and validation process.

Feature selection-generating process and generating feature subset methods

Full Search (complete)

Breadth First search (breadth)

Breadth first traverses the feature subspace. Enumerate all combinations, exhaustive search, practicality is not high.

Branch Clearance Search (Branch and Bound)

The branch limit is added on the basis of poor lifting. For example: Cutting out some of the branches that are not likely to search better than the current optimal solution.

Other, such as directed search (Beam search), best first search, etc.

Heuristic Search (heuristic)

Sequence forward selection (SFS, sequential Forward Selection)

Start with an empty set and add an optimal selection each time.

Sequence back selection (SBS, sequential backward Selection)

From the complete set, each time you reduce an optimal selection.

Add L to R selection algorithm (LRS, plus-l minus-r Selection)

Start with an empty set, add L, subtract r, select the optimal (L>r) or start with the complete set, subtract r each time, increase L, select the Best (l<r).

Other such as bidirectional search (BDS, bidirectional search), sequence floating selection (sequential floating Selection), etc.

Randomly search (random)

Randomly generated sequence selection algorithm (RGSS, random Generation plus sequential Selection)

A subset of features is randomly generated and then the SFS and SBS algorithms are executed on that subset.

Simulated annealing algorithm (SA, simulated annealing)

Accept a solution that is worse than the current solution with a certain probability, and this probability gradually decreases over time

Genetic algorithm (GA, genetic algorithms)

The next generation of feature subsets is propagated by cross-mutation operations, and the higher the score, the higher the probability that the subset of features selected to participate in reproduction.

The common disadvantage of stochastic algorithm is that it relies on stochastic factors, and the experimental results are difficult to reproduce.

Feature Selection-Validity analysis

The validity of the feature is analyzed to get the characteristic weight of each feature, according to whether it is related to the model can be divided into 1. With the model-related feature weights, using all the feature data to train the model, look at the weight of each feature in the model, because the need to train the model, the weight of the model relative to the learning model. Different models have different weight measurement methods for the model. For example, in a linear model, the weighting coefficients of features, etc. 2. Model-independent feature weights. The main analysis of the relationship between the characteristics and label, such analysis is not related to the model used in this study. Model-independent feature weight analysis methods include (1) cross-entropy, (2) Information Gain, (3) Odds ratio, (4) Mutual information, (5) KL divergence, etc.

Feature monitoring

In machine learning tasks, features are very important.

Personal experience, 80% of the effects are brought by characteristics. Is that the correlation coefficients of the final model and the actual value change with the increase of the feature number.

Monitoring and validity analysis of important characteristics, understanding of the characteristics of the model is problematic, when a particularly important feature problem, need to do a good job of filing, to prevent catastrophic results. Need to establish a long-term monitoring mechanism for feature validity

We monitor key features, one of the following feature monitoring interfaces. By monitoring we find that there is a feature that coverage is declining every day, and after contacting the feature data provider, we find that there is a problem with the data source of the characteristic data provider, and after fixing the problem, the feature returns to normal and the coverage has a great increase.

In the event of an anomaly in the discovery of a feature, we take timely steps to downgrade the service and contact the provider of the feature data to fix it as soon as possible. The lack of monitoring in the generation of feature data will also urge monitoring to solve problems at the source.

Machine Learning Inaction Series lectures: Combining the practice of American Regiment in Machine learning, we carry out an introduction to the actual combat (inaction) series (5 Articles with "Machine Learning Inaction Series" label), and introduce the basic technology that machine learning needs in the actual combat problem solving. Experience and skills. This paper mainly introduces the data cleaning and feature processing, the other four articles mainly introduce machine learning problem solving process and model training, model optimization and so on.

Review of data cleansing and feature processing in machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.