Seven common features engineering techniques

Source: Internet
Author: User
Tags numeric scalar

Application machine learning is like taking you as a great engineer, not a great machine learning expert. ---google

When it comes to data mining and data analysis, data is the basis of all problems and affects the entire engineering process. Compared with some complex algorithms, how to handle good data flexibly often takes unexpected benefits. It is necessary to use the feature engineering to deal with the data.

I. What is feature engineering

To put it simply, feature engineering is a technique that can show data like art. Why do you say that? Because good feature engineering is a good mix of professional domain knowledge, intuition and basic mathematical ability. But the most effective data rendering actually does not involve any data operations.

Essentially, the data presented to the algorithm should be able to have the relevant structure or attributes of the underlying data. When you do feature engineering, it is actually the process of converting data attributes into data features, which represent all the dimensions of the data, and if you learn all the properties of the raw data when you are modeling the data, and you do not find the potential trend of the data well, and the preprocessing of your data through feature engineering, Your algorithmic model can reduce noise disturbances, which can better identify trends. In fact, good features can even help you achieve good results with simple models.


But for new features referenced in feature engineering, it is necessary to verify the accuracy of the predictions that it does improve, rather than adding a useless feature, which will only increase the complexity of the algorithm's operations.

This article only provides some simple feature engineering skills, hoping to provide help in your future analysis.

second, common methods

1. Representing timestamps

Timestamp properties often need to be separated into multiple dimensions such as year, month, day, hour, minute, and second. But in many applications, a lot of information is not needed. For example, in a supervisory system, trying to use a ' location + time ' function to predict the degree of traffic failure in a city, most of which will be misled by a different number of seconds to learn the trend is actually unreasonable. And the dimension ' year ' also does not give a good value to the model changes, we may only need hours, days, months and other dimensions. So when we're presenting time, try to make sure that all the data you provide is what your model needs.

And don't forget the time zone, if your data source comes from a different geographic data source, don't forget to standardize your data using the time zone.

2. Explode category Properties

Some properties are category rather than numeric, for a simple example, a color attribute consisting of {red, Green, blue}, the most common way is to convert each category attribute to a two-tuple attribute, which is a value from {0,1}. So basically the added property equals the corresponding number of categories, and for each instance in your dataset, only one is 1 (the other is 0), which is the One-hot encoding (similar to converting to dummy).

If you do not understand the code, you may feel that decomposition increases unnecessary hassles (because the code adds a large number of dimensions to the dataset). Instead, you might try to convert a category property to a scalar value, such as a color property that might be {a red, green, blue}. There are two questions here, first of all, for a mathematical model, which means that the red and green are more "similar" to the blue in some sense (because |1-3| > |1-2|). Unless your category has a sorted attribute (such as a station on a railroad line), it can mislead your model. Then, it may cause statistical indicators (such as mean) to be meaningless, and worse, to mislead your model. As an example of color, if your dataset contains the same number of red and blue instances, but there is no green, then the mean value of the color may still be 2, that is, the meaning of green.

The ability to convert a category attribute to a scalar is the most effective scenario in which only two categories are available. {0,1} corresponds to {Class 1, Category 2}. In this case, there is no need to sort, and you can interpret the value of the property as a probability of belonging to Category 1 or Category 2.

3. Sub-box/partition

Sometimes, it is more meaningful to convert a numeric attribute to a class rendering, while allowing the algorithm to reduce noise interference by dividing a certain range of values into deterministic blocks. For example, we predict whether a person has a certain dress, and the age here is an exact factor. In fact, age groups are more relevant factors, all of which we can divide the age distribution into 1-10,11-18,19-25,26-40 and so on. Also, instead of breaking down these categories into 2 points, you can use scalar values because similar age groups behave similarly to each other.

Partitioning is meaningful only when you understand the basics of domain knowledge of attributes, and to determine whether attributes can be divided into concise ranges. That is, all values that fall into one partition can present a common feature. In practice, partitioning avoids overfitting when you don't want your model to always try to differentiate between values that are too close. For example, if you are interested in a city as a whole, then you can combine all the dimension values that fall into that city into one whole. Sub-bins can also reduce the impact of small errors by entering a given value into the nearest block. If the number of partitions is similar to all possible values, or if the accuracy rate is important to you, then the sub-box will not be suitable.

4. Feature crossover

Feature Crossover is one of the most important methods in feature engineering, and feature crossover is a unique way of combining two or more categories of attributes into one. This is a very useful technique when combining features that are better than a single feature. Mathematically, it is a cross-multiplication of all possible values of a category feature.

If you have a feature a,a has two possible values of {A1,A2}. Owning a feature B, there are possible values such as {B1,B2}. Then, the crossover features between A&b are as follows: {(A1,B1), (A1,B2), (A2,B1), (A2,B2)}, and you can give any name to these combined features. However, it is necessary to understand that each combination feature represents a and B respective information synergy.

Give me a chestnut, as shown in the following picture:


All blue dots belong to one class, and red belongs to a different category. We do not consider the actual model, first of all, it would be useful for us to divide the X, y values into {x<0,x>=0}&{y<0,y>=0}, and to name the partitioning result {Xn,xp} and {YN,YP}. It is clear that the I&III quadrant corresponds to the red category, and the Ii&iv quadrant is the Blue class. So if you now have feature x and feature y as crossover features, you will have four quadrant features, {I,II,III,IV} corresponding to {(XP, Yp), (Xn, Yp), (Xn, Yn), (XP, Yn)}.

An example of a better interpretation of good crossover features is similar to (longitude, latitude). The same longitude corresponds to many places on the map, and the latitude is the same. But once you combine latitude and longitude together, they represent a specific area of geography, and each part of the area has a similar feature.

Sometimes it is possible to combine the attributes of the data into a single feature through simple mathematical techniques. In the previous example, the changed feature is set to and, and has the following relationship:


The new feature is defined as:


We can determine the characteristics according to if, then the category is red, if it is a different value, it is blue;

5. Feature Selection

in order to get a better model, some algorithms are used to automatically select a subset of the original feature. This process, you will not build or modify the features you own, but will reduce noise and redundancy by trimming features.

It's not about the problems we're dealing with. Attributes that need to be removed, there are features in our data features that increase the accuracy of the model more important than others, and some features that are redundant with other features, Feature selection solves these problems by automatically selecting the subset of features that are most useful for solving the problem.

Feature selection algorithms may use scoring methods to rank and select features, such as correlations or other methods of determining the importance of features, and further methods may need to search for a subset of features by trial and error.

and by constructing the auxiliary model, stepwise regression is an example of the automatic feature selection algorithm in the model construction process, and regularization methods like lasso regression and ridge regression are also classified into feature selection, by adding additional constraints or penalties to the existing model (loss function), In order to prevent the proposed merger to improve generalization ability.

6. Feature Scaling

Sometimes, you may notice that some features have a much higher span value than other features. For example, comparing a person's income to his age, more specific examples, such as some models (like Ridge regression), require that you have to scale the eigenvalues to the same range values. Scaling prevents certain features from gaining a very large weight value over other features.

7. Feature Extraction

Feature extraction involves a series of algorithms that automatically generate some new feature sets from the original attribute, and the descending dimension algorithm belongs to this category. Feature extraction is a process of automatically reducing the observed values to a small data set that is sufficiently modeled. For list data, the available methods include some projection methods, such as principal component analysis and unsupervised clustering algorithms. For graphical data, there may be some line detection and edge detection, and there are different methods for each domain.

The key point of feature extraction is that these methods are automated (although they may need to be designed and built from a simple approach) and can solve the problem of uncontrolled high-dimensional data. In most cases, these different types of data (such as graphs, languages, videos, etc.) are stored in digital format for simulation observation.




Disclaimer: This article is reproduced in the network. The copyright belongs to the original author. If copyright is involved, please contact delete.

Original link: https://codesachin.wordpress.com/2016/06/25/non-mathematical-feature-engineering-techniques-for-data-science/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.