Data classification: Feature processing

Source: Internet
Author: User
feature processing

Question 1: How to handle continuous and discrete features when they are present.
The question was asked in Quora: What is good ways to deal with problems where do you have both discrete and continous features? The main idea is to discrete special Cited, such as an example of an answer:

x = Price (continuous type feature)  type category (discrete feature)  y = number of products sold  
[35.99 Red]  
[42.95 Green]  
[10.50 Red]  
[74.99 Blue]

Since the category of the above product has 3 possible values, so we can use 3 virtual variables to replace the category of the characteristics of the class, this feature is processed in the same way as natural language processing one-hot, so we can process the above characteristics of each sample can be represented by a 4-dimensional vector:

[35.99 1 0 0]  
[42.95 0 1 0]  
[10.50 1 0 0]   
[74.99 0 0 1]

Features after preprocessing, you can use these features to do regression AH and other machine learning tasks, if using linear regression, we need to learn 5 weights (each feature corresponds to a weight, paranoia can also be regarded as a weight, (w1x1+w2x2+w3x3+w4x4+w0) (w1x1+ W2X2+W3X3+W4X4+W0). In addition, we need to have a mean normalization of the characteristics of each column, i.e. (xcoli−μ)/σ (xcoli−μ)/σ. In fact, for the sample categories, we can only use two dimensions of the representation can also be feasible, that is, the following way to express:

[35.99 0 0]  
[42.95 0 1]  
[10.50 0 0]   
[74.99 1 0]  

In general, the feature contains both continuous and discrete features, almost all of the discrete features are handled in this way, but there is a big problem, that is, when the discrete features may be more value, it will result in this way after processing the characteristics of the dimension is very high ( The One-hot representation method has such a feature), the vector is very sparse, in the storage and operation of the time, you can use some support sparse representation of the Matrix library for processing (such as armadillo have sparse matrix representation).

Some of the different regression models are compared:
1.7 Types of Regression techniques you should know
2. Types of regressions. Which one to use?
3. Regression Analysis using Python
4. Scikit Learn Logistic regression


From:http://yongyuan.name/blog/feature-engineering-note.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.