(vii) Feature Engineering in machine learning

Source: Internet
Author: User
Tags svm

In both industry and academia, machine learning is a hot direction, but academia and industry focus on machine learning, academia focuses on the study of machine learning theory, and industry focuses on how to solve practical problems with machine learning. Combining the practice of American Regiment in Machine learning, we carry out a practical series of presentations on the basic skills, experience and skills required by machine learning in solving the practical problems of industry. In this paper, we introduce the whole process of machine learning to solve practical problems, including the key points of modeling the problem, preparing training data, extracting features, training model and optimizing model, and a few more in-depth introduction to these key links.

The following are divided into:

1) Overview of machine learning; 2) modeling problems, 3) preparing training data; 4) extracting features; 5) training model; 6) optimization model; 7) Summary.

Machine learning can be divided into unsupervised learning (unsupervised learning) and supervised learning (supervised learning), in industry, supervised learning is a more common and valuable way, as described below in this way. As shown in, supervised machine learning in solving practical problems, there are two processes, one is the offline training process (blue arrow), including data filtering and cleaning, feature extraction, model training and optimization model, and so on, the other process is the application process (green arrow), the need to estimate the data, extract features, The model obtained by the offline training is used to estimate the pre-valuation effect in the actual product. In both of these processes, offline training is the most technically challenging work (the online estimation process is a lot of work that can be reused for the offline training process), so the following focuses on the offline training process.

As shown in is a classic machine learning problem frame diagram. The work of data cleansing and feature mining is the first two steps in the box in the gray box, namely "Data cleansing and feature, callout data generation and model learning + + model application".
The blue arrows in the gray box correspond to the offline Processing section. The main work is

1) Cleaning out feature data and labeling data from raw data, such as text, images, or application data.

2) The cleaning characteristics and labeling data processing, such as sample sampling, sample tuning, anomaly removal, feature normalization, feature changes, feature combinations and other processes. The resulting data is primarily used for model training.

Model is an important concept in machine learning, which refers to the mapping of feature space to output space, and generally consists of the assumption function and the parameter w of the model (the following formula is an expression of the logistic regression model. A slightly detailed explanation of the section of the training model); The hypothetical space of a model (hypothesis spaces), which is the set of all possible w-corresponding output spaces for a given model. The models commonly used in industry are logistic Regression (LR), Gradient boosting decision Tree (abbreviated GBDT), Support Vector Machine (SVM), deep neural Network (hereinafter referred to as DNN). Model training is based on training data, to obtain a set of parameter w, so that the specific goal is optimal, that is, to obtain the characteristic space to the output space of the optimal mapping, how to achieve, see the training model chapter. This article takes the deal (purchase order) turnover estimate problem as an example (that is, estimate how much money is sold for a given deal over a period of time), and describes how to solve the problem using machine learning. First, you need:

Collect the information of the problem, understand the problem and become the expert of the problem;

Disassemble the problem, simplify the problem, and transform the problem into a machine-predictable problem.

After an in-depth understanding and analysis of deal turnover, it can be broken down into several questions:

A single model? Multiple models? How to choose?
Following the disassembly, the estimated deal turnover has 2 possible models, one is the direct estimate of the turnover, the other is to estimate the sub-problems, such as the establishment of a user number model and the establishment of a purchase rate model (access to the deal users will purchase the number of orders), and then based on the pre-valuation of these sub-problems to calculate the transaction
Different ways have different advantages and disadvantages, specific as follows:

Which mode do you choose?

1) The problem can be estimated difficulty, the difficulty is big, then consider using the multi-model;

2) The importance of the problem itself, the problem is very important, then consider using multiple models;

3) If the relationship between multiple models is clear and the relationship is clear, then multiple models can be used.

If multiple models are used, how can they be fused?

Can be based on the characteristics and requirements of the problem of linear fusion, or complex fusion. Take this question as an example, there are at least two of the following:

Model selection

For the problem of deal turnover, we think that the direct estimation is very difficult, and we hope to split the sub-problem to estimate the multi-model mode. That would require the creation of a model of user numbers and an acquisition rate model, as machine learning solves the problem in a similar way, with the purchase rate model as an example. To solve the purchase rate problem, we have to choose the model first, we have the following considerations:

Main considerations

1) Select a model that is consistent with the business objectives;

2) Select the model that matches the training data and features.

Less training data, more high-level features, the use of "complex" non-linear models (popular GBDT, Random forest, etc.);

The training data is very large and the low level features many, then the "simple" linear model (popular LR, LINEAR-SVM, etc.) is used.

Supplementary consideration

1) Whether the current model is widely used by industry;

2) Whether the current model has a more mature open source Toolkit (in-company or outside the company);

3) The current toolkit can handle the amount of data can meet the requirements;

4) Do you know whether the current model theory has been used to solve the problem before?

In order to select the model for practical problems, the business objective of transforming problem is the model evaluation goal, the transformation model evaluation target is the model optimization target, according to the different objectives of the business, the specific relationship is as follows:

In general, it is difficult to estimate the true value (regression), size order (sort), the correct interval (classification) of the target, and choose the less difficult target according to the need of the application. We need to know at least the size order or the real value for the application target of the estimated rate of purchase, so we can choose area under Curve (AUC) or mean Absolute Error (MAE) as the evaluation target to maximum The likelihood is the model loss function (i.e. optimization target). In summary, we choose Spark version GBDT or LR, based on the following considerations:

1) can solve the problem of sorting or regression;

2) We have implemented the algorithm, often used, the effect is very good;

3) Support massive data;

4) The industry is widely used.

Prepare training data

In-depth understanding of the problem, after selecting the appropriate model for the problem, the next need to prepare the data, the data is the machine learning to solve the problem, the data is not selected, the problem can not be solved, so the preparation of training data requires extra care and attention:

Note the point:

The distribution of the data to be solved is as consistent as possible;

The training set/test set distribution and the data distribution of the online prediction environment are as consistent as possible, where the distribution refers to the distribution of (x, y), not just the distribution of y;

Y data noise as small as possible, to eliminate the y noise data;

Sampling is unnecessary, sampling can often change the actual data distribution, but if the data is too large to train or a positive or negative ratio is severe (for example, more than 100:1), a sampling solution is required.

Frequently asked questions and solutions

The data distribution of the problem to be resolved is inconsistent:

1) The deal data in the purchase rate problem may vary greatly, such as food deal and hotel deal influence factors or performance is very inconsistent, need to do special treatment, either to the data in advance normalization, or the distribution of inconsistent factors as a feature, or different types of deal training model alone.

The data distribution has changed:

1) Use the data training model six months ago to predict the current data, as the data distribution may change over time and the effect may be poor. Try to use recent data training to predict the current data, historical data can be used to reduce the right to use the model, or do transfer learning.

Y data is noisy:

1) When the CTR model is established, the user does not see the item as a negative example, these item is because the user did not see to not be clicked, not necessarily the user does not like without being clicked, so these item is noisy. Some simple rules can be used to eliminate these negative noise examples, such as the use of Skip-above thought, that is, the user points over the item, there is no point over the item as a negative example (assuming the user is viewed from the top down item).

The sampling method is biased and does not cover the entire collection:

1) in the purchase rate problem, if only one store's deal is estimated, the deal of the multi-store can not be well estimated. Should ensure that the deal data of a store and multiple stores are available;

2) No objective data of the two classification problem, using rules to obtain positive/negative examples, the rule of positive/negative coverage is not comprehensive. Randomly sampled data should be manually annotated to ensure that the sampled data and the actual data are distributed consistently.

Training data on the rate of purchase

Collect n months of deal data (x) and corresponding purchase rate (Y);

Collect the last n months, excluding holidays and other unconventional time (keep the distribution consistent);

Collect only the deal of the online duration >t and the number of users > U (reduce the noise of y);

Consider deal sales life cycle (consistent distribution);

Consider the differences between different cities, different shopping areas, and different categories (keep the distribution consistent).

Extracting features

After data filtering and cleaning, it is necessary to extract the characteristics of the data, that is, to complete the conversion of the input space to the feature space (see). For a linear model or a non-linear model, different feature extraction is required, and the linear model requires more feature extraction work and techniques, while the non-linear model requires relatively low feature extraction.

For a linear model, you can simply handle the feature as input to the model.

For non-linear models, it is necessary to use domain knowledge to do nonlinear feature transformations (such as feature combination) to select features: such as the combination of user ID and user characteristics to obtain a large feature set and then to select features, this practice in the recommendation system and advertising system is more common, This is also the so-called billion-level or even 1 billion-level characteristics of the main source, because the user data is sparse, combined features can take into account both the global model and the personalized model, this problem has the opportunity to expand.

Extract_fea

In general, features can be divided into high-level and lowlevel,high level refers to the meaning of the more generic features, lowlevel refers to the meaning of the specific characteristics, for example:

DEAL A1 belongs to Poia, the average person is below 50, the purchase rate is high;

DEAL A2 belongs to Poia, more than 50 per capita, high rate of visit;

DEAL B1 belongs to Poib, the average person is below 50, the purchase rate is high;

DEAL B2 belongs to Poib, more than 50 per capita, the bottom of the purchase rate;

Based on the above data, you can draw two characteristics, poi (store) or per capita consumption, POI feature is the low level feature, per capita consumption is the high level feature; Assuming that the model learns, the following estimates are obtained:

If Dealx belongs to Poia (Lowlevel feature), the purchase rate is high;

If Dealx per capita is under 50 (Highlevel feature), the purchase rate is high.

Therefore, in general, the low level is more targeted, the individual feature coverage is small (the data containing this feature is not many), the number of features (dimensions) is very large. The higher level is more generalized, the individual feature coverage is large (there are many data with this feature), and the number of features (dimensions) is small. The predicted values of long tail samples are mainly influenced by the characteristics of high level. The predicted values of high frequency samples are mainly influenced by the characteristics of low level.

There are a number of high or low level features for the purchase rate issue, some of which are shown in:

Feature_list

Characteristics of non-linear models

1) can mainly use the high level characteristics, because the computational complexity is large, so the characteristics of the dimension is inappropriate;

2) It is possible to fit the target well by the high-level nonlinear mapping.

Characteristics of linear models

1) The feature system should be as comprehensive as possible, both high and low level;

2) You can convert the high level to low to enhance the model's fit capability.

Normalization of features

After feature extraction, if different characteristics of the range of values vary greatly, it is best to normalization of features to achieve better results, common normalization method see (a) linear regression and feature normalization (feature scaling)

Discretization of features

In order to be easy to represent and process in the model, the continuous value feature needs to be discretized. The common discretization methods include equivalence division and isometric Division. The equivalence division is to divide the characteristics by the range, and the value of each segment is treated equally. For example, the range of values for a feature is [0,10], which we can divide into 10 segments, [0,1], [up], ..., [9,10]. The isometric division is divided evenly according to the total number of samples, each of which is classified into 1 segments. For example, distance characteristics, the range of values [0,3000000], now need to cut into 10 segments, if according to the proportional division, will find that most of the samples are in the 1th paragraph. The use of isometric partitioning would avoid this problem, the final possible shard would be [0,100], [100,300], [300,500], ..., [10000,3000000], the front of the interval is relatively dense, followed by sparse.

Feature Selection

After feature extraction and normalization, if you find that there are too many features that can cause the model to be untrained, or that it is easy to cause the model to cross-fit, you need to select the feature and pick a valuable feature. Feature selection is divided into the following types of methods:

Filter: Calculates the relevance of each feature to the response variable: the Pearson coefficients and the mutual information coefficients are calculated in the common engineering methods, and the Pearson coefficients can only measure linear correlations and the mutual information coefficients can measure the correlations well, but the calculation is relatively complex. Fortunately, many toolkit contain this tool (such as Sklearn's mine), so that the correlation can be sorted to select features;

Wrapper: Select a feature subset to add the original feature set, train with the model, compare the effect before and after the addition, if the effect becomes better, it is considered that the subset feature is effective. Wrapper feature selection can eliminate the non-correlated (irrelevant) or redundant (redundant) features, so as to reduce the number of features, improve the accuracy of the model and reduce the running time. On the other hand, we choose a truly relevant feature simplification model to help understand the process of data generation. The general process is as follows:

Embedded: Combining feature selection and model training, the L1,L2 regularization is added to the loss function. L1 Regular method has the characteristic of sparse solution, so it is natural to have feature selection, but it is important to note that the feature not chosen by L1 does not mean that it is unimportant, because two features with high correlation may retain only one, if you want to determine which feature is important to cross test by L2 regular method;

In summary, here is a diagram of a feature project:

Optimization Model

After the data filtering and cleaning, feature design and selection, and model training mentioned above, a model is obtained, but what if the results are not good? What to do?

First

Reflect whether the target can be estimated, and whether the data and features are bugs.

and then

The analysis of the model is overfitting or underfitting, from the data, features and models and other aspects of targeted optimization.

underfitting& Overfitting

The so-called underfitting, that is, the model does not learn the intrinsic relationship of data, as shown in the left, the resulting classification surface can not be very good to distinguish between the X and O data, the underlying reason is that the model assumes that the space is too small or the model assumes that the space deviation.

The so-called overfitting, that is, the model transition fitting the intrinsic relationship of training data, as shown in the right, the resulting classification surface is too good to distinguish between the X and O two types of data, and the real classification surface may not be so, so that the non-training data performance is not good, the underlying causes, Is the contradiction between the huge model hypothesis space and the sparse data.

Underfitting_overfitting

In actual combat, the model can be based on the performance of the training set and test set to determine whether the current model is underfitting or overfitting, the way to determine the following table:

How to solve underfitting and overfitting problem?

Summarize

In summary, machine learning solves problems involving problem modeling, preparing training data, extracting features, training models, and optimizing models, with the following key points:

Understand the business, decompose the business objectives, and plan the roadmap for the model to predict.

Data

Y data as realistic and objective as possible;

The training set/test set distribution is as consistent as possible with the data distribution of the on-line application environment.

Characteristics

Feature extraction and selection using Domainknowledge;

Design different features for different types of models.

Model

Choose different models for different business goals, different data and features;

If the model does not meet the expectations, be sure to check the data, features, models and other processing links whether there are bugs;

Consider model Underfitting and qverfitting, targeted optimization.

Reference: http://tech.meituan.com/machinelearning-data-feature-process.html

(vii) Feature Engineering in machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.