Data cleaning and feature processing in machine learning based on the United States ' single rate prediction

Last Update:2015-03-17 Source: Internet

Author: User

Keywords For example cleaning we

Tags .mall analysis applied based behavior business business experience change

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper mainly introduces the methods of data cleaning and feature mining in the practice of recommendation and personalized team in the United States. In this paper, an example is given to illustrate the data cleaning and feature processing with examples.

At present, the group buying system in the United States has been widely applied to machine learning and data mining technology, such as personalized recommendation, filter sorting, search sorting, user modeling and so on. This paper mainly introduces the methods of data cleaning and feature mining in the practice of recommendation and personalized team in the United States.

Review

Machine Learning Framework

As shown in the above picture is a classic machine learning problem frame diagram. The work of data cleaning and feature mining is the first two steps of the box in the gray box, namely "Data cleaning => features, marking data generation => Model Learning => model Application".

The blue arrow in the gray box corresponds to the Offline Processing section. The main work is

Purge feature and callout data from raw data, such as text, images, or application data.

The cleaning characteristics and annotation data are processed, such as sample sampling, sample adjustment, anomaly removal, feature normalization, feature change, feature combination and so on. The resulting data is primarily used for model training.

The green arrows in the gray box correspond to the portions of the online processing. The main difference is that the primary work is similar to offline processing:

No need to clean the annotation data, only need to deal with the feature data, the on-line model uses the characteristic data to predict the sample possible label.

The end result is the usefulness of generating data, which is used primarily for model predictions, not training.

In the off-line processing part, we can do more experiments and iterations, try different sample sampling, sample weight, feature processing method, feature combination method and so on, finally get an optimal method, after the off-line evaluation get good results, the final plan will be used online.

In addition, due to the different online and offline environments, there are significant differences in the methods of storing and retrieving data. Offline data acquisition, for example, can store data in Hadoop, perform analytical processing in batches, and tolerate certain failures. While the online service acquisition data needs stable, delay and so on, can build data into the index, into the KV storage system. Later in the corresponding section will be described in detail.

In this paper, we take the single rate prediction as an example to introduce how to do data cleaning and feature processing with examples. First of all, the next click on the single rate forecast task, the business goal is to improve the user experience of group purchase, to help users faster and better find the list they want to buy. This concept or goal may seem virtual, and we need to translate it into a technical goal that is easy to measure and implement. The final technical goal is to click on a single rate estimate to predict the probability of a user clicking or buying a purchase order. We will predict the click or the high single rate of the list in front, the more accurate prediction, the user in the order of the list click, the more orders, save the user repeatedly paging overhead, will soon be able to find their own list. Off-line we use commonly used to measure the result of the AUC index, online we through the abtest to test the algorithm on the lower single rate, the user conversion rate and other indicators of the impact.

Feature usage Scenarios

After the goal is identified, next, we need to determine what data to use to achieve the goal. What feature data needs to be combed beforehand may be related to whether the user clicks on the order. We can draw on some business experience, in addition, we can use some feature selection, feature analysis and other methods to help us choose. Specific feature selection, feature analysis and other methods we will detail later. Judging from the business experience, the factors that may affect whether the user clicks on the list are:

Distance, obviously this is a very important feature. If you buy a list that is far from the user, the user will have to pay a lot of price to consume the list. Of course, it is not a very long list of users do not buy, but this ratio will be relatively small.

User history behavior, for the old users, before the United States may have purchased, click And so on.

User interest in real time.

The quality of the list, the characteristics above are better measured, the quality of the list may be a more complex feature.

Is hot, the user appraises the number, the purchase number and so on.

After determining what data to use, we need to evaluate the usability of using data, including the difficulty of data acquisition, the size of the data, the accuracy of the data, the coverage of the data, etc.

Data Access Difficulty

For example, getting a user ID is not difficult, but getting the user's age and gender is difficult because these are not required when the user registers or buys. It's not exactly accurate even if it's filled in. These characteristics may be predicted by additional predictive models, and there is a problem of model accuracy.

Data coverage

Data coverage is also an important consideration, such as the distance feature, which is not available to all users. There is no distance on the PC side, and many users are prohibited from using their geographic information.

User history behavior, only old users will have behavior.

User real-time behavior, if the user just opened the app, there is no behavior, also face a cold start problem.

Accuracy rate of data

The quality of the list, the gender of the user, will have the problem of accuracy.

Feature acquisition Scheme

OK, we need to consider a problem after selecting the features we want to use. Where does this data get? We can use it only if we get the data. Otherwise, it is Baiti to mention a feature that is impossible to obtain, which can not be obtained. Below is a description of the feature acquisition scheme.

Off-line feature acquisition scheme

Offline can use a large amount of data, with the help of distributed file storage platform, such as HDFS, using such as mapreduce,spark processing tools to deal with a large number of data.

On-line feature acquisition scheme

On-line features pay more attention to the delay of obtaining data, because it is an online service, it needs to obtain the corresponding data in a very short time, it is very high to find performance, and can store the data in index, KV storage and so on. There is a contradiction between finding performance and data volume, and we need to compromise, and we use a feature layering approach, as shown in the following illustration.

Service architecture

For performance reasons. In the rough row phase, the data is built directly into the index using more basic features. Fine-ranked stage, and then use some personalized features.

Feature and annotation data cleaning

After you know where the feature data is and how to get it. The next step is to consider how to deal with features and tagging data. The following 3 sections are mainly about features and annotation processing methods.

1, marking data cleaning

Firstly, the paper introduces how to clean the feature data, and the method of cleaning feature data can be divided into off-line cleaning and online cleaning.

Clean data offline

The advantage of off-line cleaning is that it is convenient to evaluate the new feature effect, the disadvantage is that the real time is poor and there is some error in the online real-time environment. For real-time features difficult to train to get the right weight.

Cleaning Data Online

The advantage of online cleaning is strong real-time, fully recorded on-line actual data, the disadvantage is that new features to add time to do data accumulation.

Sample sampling and sample filtration

Feature data can only be used as a model training after merging with annotation data. The following describes how to clean the callout data. Mainly data sampling and sample filtering.

Data sampling, for example, for classification problems: Select positive examples, negative examples. For regression problems, data acquisition is needed. For sampled samples, it is necessary to set the sample weight. When the model can not use all the data to train, we need to sample the data, set a certain sampling rate. The sampling method includes random sampling, fixed proportional sampling and so on.

In addition to sampling, regular samples also need to be filtered, including

Data filtering, such as removing crawler crawl, spam, cheating and so on, is combined with the business situation.

Anomaly detection, using anomaly detection algorithm to analyze the samples, common anomaly detection algorithms include

Deviation detection, such as clustering, nearest neighbor and so on.

Outlier detection algorithm based on statistics. For example, extreme difference, four-digit spacing, mean difference, standard deviation, this method is suitable for mining univariate numerical data. The total distance (Range), also known as the extreme difference, is used to indicate the variance in the statistical data (measures of variation), the gap between the maximum and the minimum value; The four-point distance is usually used to build a box chart and a brief chart overview of the probability distribution.

Anomaly detection algorithm based on distance, mainly detects anomaly points by distance method, treats points with distance greater than a certain threshold between data sets and most points as anomaly points, and the main methods of distance measurement are absolute distance (Manhattan distance), Euclidean distance and Markov distance.

Based on density anomaly detection algorithm, we can find the local anomaly point, such as Lof algorithm, by investigating the density around the current point.

2. Feature classification

After the analysis of the features and labeling of the cleaning method, the following to specifically describe the characteristics of the processing methods, the first classification of features, for different characteristics should have different processing methods.

According to the different classification methods, the characteristics can be divided into: (1) The low level feature and the high level characteristic. (2) Stability characteristics and dynamic characteristics. (3) Two value features, continuous features, enumeration features.

The lower level feature is a low-level feature, primarily a primitive feature that requires little or no manual processing and intervention, such as word vector features in text features, pixel points in image features, user IDs, product IDs, and so on. The lower level feature has a higher general dimension than the overly complex model. The higher level feature, which is characterized by more complex processing, combined with some business logic or rules and models, such as manual scoring and model scoring, can be used in more complex non-linear models. Low level is relatively targeted, small coverage. The predicted value of the long tail sample is mainly influenced by the high level feature. The predictive value of high-frequency samples is mainly influenced by the low level characteristics.

Stable characteristics are the characteristics of varying frequency (update frequency), such as evaluation average, group purchase price and so on, in a longer period of time will not change. Dynamic features are frequent features of update changes, and some are even real-time computed features such as distance features, 2-hour sales, and more. Or it's called real-time and non-real-time features. For the difference of two kinds of features can be targeted to design feature storage and update methods, such as for stability features, can be built into the index, a long time to update, if the cache, the time can be longer. For dynamic features, real-time computation or real-time updating of the data is required, and if caching is done, the cache expiration time needs to be set shorter.

The two-valued feature is mainly 0/1 features, that is, the feature takes only two values: 0 or 1, such as the user ID feature: whether the current ID is a specific ID, the word vector feature: whether a particular word appears in the article, and so on. The characteristic of continuous value is the characteristic that takes value as rational number, the number of eigenvalues is indefinite, for example the distance characteristic, the characteristic value is 0~ positive infinity. Enumeration value characteristics are mainly characterized by a fixed number of possible values, such as the day of the week, only 7 possible values: Week 1, Week 2, ..., Sunday. In practical use, we may be able to convert different types of features, such as enumerating features or continuous features to two-valued features. Enumeration feature processing is a two-valued feature: Mapping an enumeration feature to multiple features, with each feature corresponding to a specific enumeration value, such as the day of the week, which can be converted to 7 two-dollar features: Today is Monday, today is Tuesday, whether today is Sunday. Continuous value processing is two-valued feature method: First, the continuous value discretization (how to discretization later), then the discretization of the characteristics of the cut into N two-ary features, each feature represents whether in this interval.

3, feature processing and analysis

After classifying the features, the following is a common approach to the features. Includes: 1. Feature normalization, discretization, default value processing. 2. Feature dimensionality reduction method. 3. Feature Selection methods.

Normalization of features, discretization, default value processing

Used primarily for processing individual features.

Normalized

Different characteristics have different range of values, in some algorithms, for example, linear models or distance-related models such as clustering model, KNN model, etc., the range of features will have a greater impact on the final results, such as the two-element feature of the range is [0,1], and the distance feature can be a value of [0, positive Infinity), The distance is truncated in actual use, for example [0,3000000], but these two characteristics due to the difference in the range of values caused the model may be more inclined to take a larger range of features, in order to balance the value range of inconsistent features, the feature needs to be normalized, the feature value normalized to [0,1] Interval。 Common normalization methods include:

function normalization, the mapping function is used to map feature values to [0,1] intervals, such as the maximum minimum normalization method, which is a linear mapping. There are also mappings through non-linear functions, such as log functions.

Dimension normalization, the maximum minimum normalization method can be used, but the maximum minimum value is chosen by the maximum minimum value of the owning class, that is, the local maximum minimum value is used, not the global maximum minimum value.

Sort normalization, regardless of the original feature value, sorts the feature by size and gives a new value according to the order of the feature.

Discretization

It is shown that the value space of the continuous value may be infinite, and the continuous value feature should be discretized in order to be easily expressed and processed in the model. The commonly used discretization methods include the equivalence division and the isometric division. The equivalent division is to divide the characteristics according to the domain value, and the values in each paragraph are equal to the processing. For example, a feature whose range is [0,10] can be divided into 10 segments, [0,1), [1,2], ..., [9,10). The equal division is divided according to the total number of samples, each sample is divided into 1 segments. For example, distance characteristics, the value range [0,3000000], now need to cut into 10 paragraphs, if according to the proportional division, will find that most of the samples are in the 1th paragraph. Using an isometric division avoids this problem, and the final possible segmentation is [0,100], [100,300], [300,500], ..., [10000,3000000], the previous interval is more dense, the back is sparse.

Default value Processing

Some features may be missing because they cannot be sampled or have no observations, such as distance features, users may be prevented from acquiring geographic locations, or failed to acquire a geographic location, and these features need to be handled specifically to give a default value. There are a number of ways to give the default value. such as individual representation, number, average, etc.

Feature dimensionality reduction

Before the feature dimensionality reduction is introduced, the feature ascending dimension is introduced first. In machine learning, there is a VC dimension theory. According to VC dimension theory, the higher the VC dimension, the stronger the ability to break down, the higher the permissible model complexity. In the low dimensional irreducible data, mapping to the high-dimensional is measurable. Can think about, give you a bunch of stuff, how the brain classifies these objects, and still find some of the characteristics of these items, such as color, shape, size, touch, and so on, and then classify the objects according to these characteristics, which is actually a first ascending dimension, a process of division. For example, our brain recognizes bananas. Perhaps first we found the banana to be yellow. This is a segmentation of this dimension in color. But a lot of things are yellow, such as cantaloupe. So how do you differentiate between bananas and cantaloupe? We found that the banana shape is curved. and Hami Melon is round, then we can use shape to divide banana and cantaloupe, namely introduce a new dimension: shape, to differentiate. This is an example of an ascending dimension from a "color" to a two-dimensional feature.

That's the problem, since the model can be stronger after the dimension, then is not the higher the feature dimension is better? Why do I have to feature dimensionality reduction & feature selection? Mainly for the following considerations:

The higher the feature dimension, the easier it is to fit the model, at which point the more complex models are not used.

The higher the independent feature dimension, the greater the number of training samples needed to achieve the same effect performance on the test set.

Increases in the amount of training, testing, and storage costs associated with increased feature numbers.

In some models, such as model KMEANS,KNN based on distance computation, when the distance is calculated, the dimension is too high to affect precision and performance.

The need for visual analysis. In the case of low dimensions, such as two-dimensional, three-dimensional, we can plot the data to visually see the data. When the dimension increases, it is difficult to draw out. In machine learning, there is a very classic concept of dimensionality catastrophe. It is used to describe the analysis and organization of high dimensional space when the space dimension increases, and the various problem scenarios are encountered due to the increase of volume index. For example, 100 evenly distributed points can take a unit interval of less than 0.01 samples per point, and when the dimension increases to 10, if the adjacent point distance is not more than 0.01 small square Unit Super cube, the sampling point needs to 10^20.

Because of the various problems described in the high dimensional features, we need to perform feature reduction and feature selection. The common algorithms of feature dimensionality reduction are Pca,lda and so on. The goal of feature dimensionality reduction is to map datasets in high-dimensional space to low dimensional data and to lose as little information as possible, or to differentiate the data points after dimensionality as easily as possible.

PCA algorithm

The principal component of the data can be obtained by the eigenvalue decomposition of the covariance matrix, and in the case of two-dimensional features, there may be a linear relationship between the two features (for example, the speed of the motion and the second velocity), thus causing the second dimensional information to be redundant. The objective of PCA is to discover the linear relationship between this feature and remove it.

LDA algorithm

Considering the label, the data points after dimensionality are as easy to distinguish as possible

Feature Selection

The target of feature selection is to find the optimal feature subset. Feature selection can eliminate the characteristics of unrelated (irrelevant) or redundancy (redundant), so as to reduce the number of features, improve the accuracy of the model and reduce the running time. On the other hand, the real related feature simplification model is selected to help understand the process of data generation.

The general process of feature selection is shown in the following illustration:

Process of Feature selection (M. Dash and H. Liu 1997)

Mainly divided into production process, evaluation process, stop condition and verification process.

Feature selection-generation process and generation feature subset method

Full Search (Complete)

Breadth Breadth First Search

Breadth first traverses the feature subspace. Enumerate all combinations, exhaustive search, practicality is not high.

Branch Clearance Search (Branch and Bound)

The branch gauge is added on the basis of exhaustive. For example: Cut off some branches that are not possible to search better than the current optimal solution.

Others, such as directed Search (Beam), Optimal first search

Heuristic Search (heuristic)

Sequential forward selection (SFS, sequential Forward Selection)

Start with the empty set and add one to the optimal selection each time.

Sequential back selection (SBS, sequential backward Selection)

Start with the complete set, each one to reduce the optimal selection.

L-R selection algorithm (LRS, plus-l minus-r Selection)

Start with the empty set, add the L, minus R, select the Best (l>r) or start from the complete, subtract R at a time, add L, and choose the Best (l<r). < p= "" >

Other such as bidirectional search (BDS, bidirectional search), serial floating selection (sequential Floating Selection), etc.

Random Search (Random)

Random generation Sequence Selection algorithm (RGSS, Random generation plus sequential Selection)

A feature subset is randomly generated and then the SFS and SBS algorithms are executed on the subset.

Simulated annealing algorithm (SA, simulated annealing)

Accept a solution that is worse than the current solution by a certain probability, and the probability decreases over time

Genetic algorithm (GA, genetic algorithms)

The next generation feature subset is propagated by crossover and mutation, and the higher the score, the higher the probability of a feature subset being selected for reproduction.

Common drawbacks of stochastic algorithms: dependent on random factors, the results of experiments are difficult to reproduce.

Feature Selection-Validity analysis

By analyzing the validity of the feature, the characteristic weights of each feature are obtained, according to whether the model can be divided into:

With model-related feature weights, all the characteristic data are used to train the model, to see the weights of each feature in the model, and the weights associated with the model are related to the model of the learning. Different models have different weights and measures. For example, in a linear model, the weighting coefficients of the features.

The weight of the model-independent feature. The main analysis of the relationship between the characteristics and label, such an analysis is not related to the model used in this study. The methods of the model-independent feature weight analysis include: (1) Cross entropy, (2) Information Gain, (3) Odds ratio, (4) Mutual information, (5) KL divergence.

Feature monitoring

In machine learning tasks, features are very important.

Personal experience, 80% of the effects are brought by characteristics. The following figure is the change of the correlation coefficient between the final model prediction and the actual value with the increase of the characteristic number.

Feature importance

For the important characteristics of monitoring and effectiveness analysis, to understand the characteristics of the model used by the problem, when a particular important feature problems, it is necessary to record, to prevent catastrophic results. Long-term monitoring mechanism to establish feature effectiveness

We monitor key features and a screenshot of the feature monitor interface below. By monitoring we found that there is a feature of the coverage of the daily decline, and the characteristics of the data provider after contact, found that the characteristics of the data provider data source problems, after fixing the problem, the feature returned to normal and the coverage has been greatly improved.

Feature monitoring

In the discovery of abnormal features, we will take timely measures to downgrade the service, and contact the provider of the feature data to repair as soon as possible. The lack of monitoring in the process of feature data generation will also urge monitoring to solve problems at source.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More