" **Feature Engineering** " is a gorgeous term that ensures that your predictors are encoded into the model in a way that makes the model as easy as possible to achieve good performance. For example, if you have a date field as a predictor, and it is very different in response to weekdays on weekends, it's easier to get good results by encoding dates in this way.

However, this depends on many aspects.

First, it is dependent on the model. For example, if the class boundary is a diagonal line, then the tree may have trouble with the categorical dataset because the classification boundary uses the orthogonal decomposition of the data (except for the oblique tree).

Secondly, the predictive coding process benefits most from the specific subject knowledge of the problem. In the example I just listed, you need to understand the data schema and then improve the format of the predictors. Feature engineering is very different from image processing, information retrieval and RNA expression spectrum. You need to know some information about the problem and use your specific data set to do the feature work.

Here are some training set data, using two predictors to build a two classification system model (I'll reveal the data source later):

There are also related test sets that we will use below.

We can get the following conclusions:

- These data are highly correlated (correlation coefficient =0.85).
- Each predictor appears to be tilted to the right.
- They seem to be more informative, and in a sense you may be able to draw a diagonal line to differentiate between categories.

Depending on the model we choose to use, the correlation of two predictors may be bothering us. Again, we should check whether a single predictor is important. To measure this, we will directly use the area below the ROC curve on the forecast data.

The following is a single-variable box plot of each predictor (on a logarithmic scale):

There are some subtle differences between the two classes, but there are a lot of overlapping parts. The ROC curve area of the forecast model A and B is 0.61 and 0.59, respectively. The result is not good.

So what can we do? Principal component Analysis (PCA) is a preprocessing method that rotates predictive data in a way that creates a new synthetic predictor (i.e., the main ingredient or PC's). It is analyzed in such a way that the first component is the proportion of most (linear) variables or information in the predicted data. After extracting the first ingredient, the second component processes the rest of the data in the same way, and then goes down in a sequence. For this data, there are two possible components (because there are only two predictors). Using PCA in this way is often referred to as **feature extraction** .

Let's calculate these ingredients:

> Library (Caret) > head (example_train)

Predictora predictorb Class2 3278.726 154.89876 One3 1727.410 84.56460 Two4 1194.932 101.09107 One12 1027.222 68.71062 Two15 1035.608 73.40559 One16 1433.918 79.47569 One

> pca_pp <-preprocess (example_train[, 1:2],+ method = C ("center", "scale", "PCA") + PCA_PP

Call:preProcess.default (x = example_train[, 1:2], method = C ("center", "scale", "PCA")) Created from 1009 samples and 2 var Iablespre-processing:centered, scaled, principal component signal extraction PCA needed 2 components to capture Percen T of the variance

> train_pc <-Predict (pca_pp, example_train[, 1:2]) > TEST_PC <-Predict (pca_pp, example_test[, 1:2]) > head (TEST_PC, 4)

PC1 PC21 0.8420447 0.072848025 0.2189168 0.045684176 1.2074404-0.210405587 1.1794578-0.20980371

Note that we calculate all the necessary information on the training set and apply those calculations to the test set. So what is the test set like?

This is a simple rotation of the test set predictor factor.

PCA is non-supervised, which means that the output class does not need to be considered when the calculation is complete. Here, in the lower part of the ROC Curve, the area obtained with the first component is 0.5, and the second component gets an area of 0.81. These results are mixed with the above points; The first component has a random blend in the class, and the second component seems to be able to separate the class well. The box diagram of the two components reflects the same situation:

In the second component, the separation of the two classes is higher.

This is interesting. First, despite the non-supervised PCA, it succeeded in finding a new predictor to classify the categories. Second, these components are ultimately needed for these categories, but less important for forecasters. Normally PCA does not guarantee that any component will give accurate predictions. But here we are fortunate that it gets a good prediction result.

But imagine if there are hundreds of predictors. We may only need to use the first X ingredients to get the most information from the predictors, and then discard the other ingredients. In this example, the first component occupies 92.4% of the Predictor variable, and the same method may discard the most effective predictor.

How does the idea of feature engineering come about? Given these two predictors, we can get the scatter plot shown below, the first thing I think of is "there are two correlated, positive and oblique predictors, one after the other to classify". Secondly, I think of the "utilization ratio". So what is the data like?

The corresponding area below the ROC curve is 0.8, which is similar to the result of the second component. A simple conversion based on data visualization may be equivalent to an empirical algorithm without bias.

The data came from the cell division experiments of Hill and other people, where predictor A is the "spherical surface of an equivalent circle diameter from rotation" (labeled EQSPHEREAREACH1), and Predictor B is the perimeter of the nucleus (PERIMCH1). A high-content expert may naturally adopt the ratio of these two cell characteristics, as it will bring good results in scientific sense (I am not that person). Within the scope of this problem, their intuition should drive the feature engineering process.

However, the machine will benefit when it comes to ensuring the effectiveness of the PCA algorithm. In general, there are nearly 60 predictors in these data, and their characteristics are similar to those of EQSPHEREAREACH1. My personal hobby is "Haralick structure measurement based on the pixel space arrangement of the Symbiosis Matrix". For this study for some time. The point is that there are often too many features that need to be designed, and they are probably not intuitive at first.

The other aspect of feature extraction relates to relevance. It is well understood that there is often a high degree of correlation between predictors on a particular dataset. For example, there are different ways to quantify the centrifuge rate of a cell (such as the degree of stretching). In addition, the size of the nucleus is related to the overall size of the cell and so on. PCA can significantly alleviate the effect of correlation. Manual use of the multi-predictor scale may seem less effective and will take more time.

Last year, in a R & R Group I supported, there was a controversy between the scientists who focused on bias analysis (that is, building our pre-known models) and focusing on non-biased analysis, that is, allowing machines to find the optimal model. My point of view is between the two, thinking that there are some intersections between them. Once excavated, the machine can label new and interesting features with "known things" and use them as knowledge.

**Original link:** Feature Engineering versus Feature extraction:game on! (Translator/Heredity reviewer/Liu Diwei, Zhu Zhengju Zebian/Zhou Jianding)

Feature engineering vs. feature extraction