Beginners Guide to learn Dimension Reduction techniques

Source: Internet
Author: User
Tags new set

Beginners Guide to learn Dimension Reduction techniquesintroduction

Brevity is the soul of wit

This powerful quote by William Shakespeare applies well to techniques used in data Science & Analytics as well. Intrigued? Allow me to prove it using a short story.

In could ', we conducted a data Hackathon (a Data science competition) in DELHI-NCR, India.

Register for Data Hackathon 3.0–the Battle of survival

We gave participants the challenge to identify Human Activity recognition Using smartphones Data Set. The data set had 561 variables for training model used for the identification of Human activity in test data set.

The participants in hackathon had varied experience and expertise level. As expected, the experts did a commendable job at identifying the human activity. However, beginners & intermediates struggled with sheer number of variables in the dataset (561 variables). Under the pressure of time, these people tried using variables really without understanding the significance level of Var  Iable (s). They lacked the skill to filter information from seemingly high dimensional problems and reduce them to a few relevant dim Ensions–the skill of dimension reduction.

Further, this lack of skill came across in several forms in the same as the questions asked by various participants:

    • There is too many variables–do I need to explore each and every variable?
    • Is all variables important?
    • All variables is numeric and what if they has multi-collinearity? How can I identify these variables?
    • I want to use decision tree. It can automatically select the right variables. Is this a right technique?
    • I am using random forest but it's taking a high execution time because of high number of features
    • Is there any machine learning algorithm which can identify the most significant variables automatically?
    • As this was a classification problem, can I use the SVM with all variables?
    • Which is the best tool for deal with high number of variable, R or Python?

If you had faced similar questions, you were reading the right article. In this article, we'll look at various methods to identify the significant variables using the most common dimension Reduction techniques and methods.

Table of Contents
    1. Why Dimension Reduction are Important in machine learning and predictive modeling?
    2. What is Dimension Reduction techniques?
    3. What is the benefits of using Dimension Reduction techniques?
    4. What is the common methods to reduce number of Dimensions?
    5. is dimensionality Reduction good or bad?

Why Dimension Reduction are important in machine learning & predictive modeling?

The problem of unwanted increase in dimension are closely related to fixation of measuring/recording data at a far GRA Nular level then it is done in past. This is the no-to-the-suggesting that's a recent problem. It has started gaining more importance lately due to surge in data.

Lately, there have been a tremendous increase in the "the" sensors is being used in the industry. These sensors continuously record data and store it for analysis at a later point. In the the captured, there can is a lot of redundancy. For example, let us take the case of a motorbike rider in racing competitions. Today, his position and movement gets measured by GPS sensor on bike, gyro meters, multiple video feeds and his smart Watch. Because of respective errors in recording, the data would isn't be exactly same. However, there is very little incremental information on position gained from putting these additional sources. Now assume, that an analyst sits with all this data to analyze the racing strategy of the Biker–he/she would has a lot of Variables/dimensions which is similar and of little (or no) incremental value. This is the problem of high unwanted dimensions and needs a treatment of dimension reduction.

Let's look at other examples of new ways of data collection:

    • Casinos is capturing data using cameras and tracking each and every move of their customers.
    • Political parties is capturing data by expanding their reach on field
    • Your Smart phone apps collects a lot of personal details on you
    • Your Set Top box collects data about which programs preferences and timings
    • Organizations is evaluating their brand value by social media engagements (comments, likes), followers, positive and Nega tive sentiments

With more variables, comes more trouble! And to avoid this trouble, dimension reduction techniques comes to the rescue.

What is Dimension Reduction techniques?

Dimension Reduction refers to the process of converting a set of data have vast dimensions into data with lesser Dimensi ONS ensuring that it conveys similar information concisely. These techniques is typically used while solving machine learning problems to obtain better features for a class ification or regression task.

Let's look at the image shown below. It shows 2 dimensions x1 and x2, which is let us say measurements of several object in cm (x1) and Inches (x2). Now, if you were to use both these dimensions in machine learning, they would convey similar information and introduce a LO T of noise in system, so is better of just using one dimension. Here we had converted the dimension of data from 2D (from X1 and x2) to 1D (Z1), which had made the data relatively easie R to explain.

In similar ways, we can reduce n dimensions of the data set to K dimensions (K < n). These k dimensions can be directly identified (filtered) or can be a combination of dimensions (weighted averages of dimen sions) or new dimension (s) that represent existing multiple dimensions well.

One of the most common application of this technique is Image processing. You might has come across this Facebook application– "which celebrity doesyour look like?". But, what ever thought about the algorithm used behind?

Here's the answer:to identify the matched celebrity image, we use pixel data and each pixel are equivalent to one Dimensio N. In every image, there is high number of pixels i.e. high number of dimensions. And every dimension is important here. You can ' t omit dimensions randomly to make better sense of your overall data set. In such cases, dimension reduction techniques help you to find the significant dimension (s) using various method (s). We ' ll discuss these methods shortly.

What is the benefits of Dimension Reduction?

Let's look at the benefits of applying Dimension Reduction process:

    • It helps in data compressing and reducing the storage space required
    • It fastens the time required for performing same computations. Less dimensions leads to less computing, also less dimensions can allow usage of algorithms unfit for a large number of Di Mensions
    • It takes care of multi-collinearity that improves the model performance. It removes redundant features. For Example:there are no point in storing a value in the different units (meters and inches).
    • Reducing the dimensions of data to 2D or 3D could allow us to plot and visualize it precisely. You can then observe patterns more clearly. Below can see the How a 3D data is converted into 2D. First it has identified the 2D plane then represented the points on these the new Axis Z1 and Z2.
    • It is helpful in noise removal also and as result of this we can improve the performance of models.

What is the common methods to perform Dimension Reduction?

There is many methods to perform Dimension reduction. I have listed the most common methods below:

1. Missing values: while exploring data, if we are encounter Missing Values, what are we do? Our first step should is to identify the reason then impute missing values/drop variables using appropriate methods. But, what if we have too many missing values? Should we impute missing values or drop the variables?

I would prefer the latter, because it would not has lot more details about data set. Also, it would not to help in improving the power of model. Next question, is there any threshold of the missing values for dropping a variable? It varies from case to case. If the information contained in the variable was not so much, you can drop the variable if it had more than ~40-50% Missi ng values.

2. Low Variance: Let's think of a scenario where we have a constant variable (all observations has same value, 5 ) in our data set. Do you think, it can improve the power of model? Ofcourse not, because it has zero variance. In case of high number of dimensions, we should drop variables have low variance compared to others because these Variab Les won't explain the variation in target variables.

3. Decision Trees: It is one of my favorite techniques. It can be used as a ultimate solution to tackle multiple challenges like missing values, outliers and identifying signific Ant variables. It worked well in our Data Hackathon also. Several data scientists used decision tree and it worked well for them.

4. Random Forest: Similar to decision tree is random Forest. I would also recommend using the In-built feature importance provided by random forests to select a smaller subset of Inpu T features. Just is careful that random forests has a tendency to bias towards variables, that has more No. of distinct values i.e. favor numeric variables over binary/categorical values.

5. High Correlation: Dimensions exhibiting higher Correlation can lower down the performance of model. Moreover, it isn't good to has multiple variables of similar information or variation also known as "Multicollineari Ty. " You can use Pearson (continuous variables) or polychoric (discrete variables) correlation matrix to Iden Tify the variables with high correlation and select one of the them using VIF (Variance inflation Factor). Variables has higher value (VIF > 5) can be dropped.

6. Backward Feature elimination: in the This method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal have produced the smallest increase in the SSR and removing it finally, leaving U S with n-1 input features.

Repeat this process until no and variables can be dropped. Recently in Online Hackathon organised by Analytics Vidhya (11-12 June '), Data scientist who held second position used Ba Ckward Feature elimination in linear regression to train his model.
Reverse to the We can use "Forward Feature Selection" method. In this method, we select one variable and analyse the performance of model by adding another variable. Here, selection of variable are based on higher improvement in model performance.

7. Factor Analysis: Let ' s say some variables is highly correlated. These variables can be grouped by their correlations i.e. all variables in a particular group can be highly correlated AMO Ng themselves but has a low correlation with variables of the other group (s). Here each group represents a single underlying construct or factor. These factors is small in number as compared to large number of dimensions. However, these factors is difficult to observe. There is basically, methods of performing factor analysis:

    • EFA (Exploratory Factor analysis)
    • CFA (confirmatory Factor analysis)

8. Principal Component Analysis (PCA):   in this technique, variables is transformed into a new SE t of variables, which is linear combination of original variables. These new set of variables is known as  principle components.  they are obtained in such a-the-first principle component accounts for most of the possible variation of O Riginal data after Which each succeeding component have the highest possible variance.

The second principal component must is orthogonal to the first principal component. In other words, it does it's best to capture the variance in the "the" and "the data" is not captured by the first principal component . For two-dimensional dataset, there can is only the principal components. Below is a snapshot of the data and its first and second principal. You can notice this second principle component is orthogonal to first principle component. The principal components is sensitive to the scale of measurement and now to fix this issue we should always standardize Var Iables before applying PCA. Applying PCA to your data set loses its meaning. If interpretability of the results is important for your analysis, the PCA are not the right technique for your project.

is Dimension Reduction good or bad?

Recently, we received this question in our Data Science forum. Here's the complete answer.

End Note

In this article, we looked at the simplified version of Dimension Reduction covering its importance, benefits, the COMMONL Y methods and the discretion as to if to choose a particular technique. In the future post, I would write about the PCA and Factor analysis in more detail.

Did you find the article useful? Do let us know your thoughts on this article in the comment box below. I would also want to know which dimension reduction technique what do I do?

If you are just read & want to continue your analytics learning, subscribe to US emails, follow we on Twitt Er or like our Facebook page.

Beginners Guide to learn Dimension Reduction techniques

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.