Data mining processing classification independent variables and processing time variables

Source: Internet
Author: User

Some data mining methods can deal with categorical independent variables directly, but many data mining methods can only deal with numerical variables, such as linear regression, neural network, and so on, it is necessary to convert categorical arguments into numerical arguments when using these methods.

For fixed-order arguments, one of the most commonly used conversions is to convert the variable directly to a numeric argument by the sequence number of each category. For nominal arguments, the most common conversion is to convert the variable to a dummy variable. For example, for gender, you can generate a two-dollar dummy variable, with a value of 1 for "female" and 0 for "male."

For nominal independent variables with multiple values, a series of two-yuan dummy variables can be generated. For example, there are 31 provinces, autonomous regions and municipalities in mainland China, which can generate 30 dummy variables accordingly. However, if a nominal argument has too many values, generating too many dummy variables can lead to overfitting. A simple and effective way is to generate dummy variables only for categories that contain more observations, and to attribute the remaining categories to the "other" category. Another way is to use domain knowledge to classify the categories into several major categories and then generate dummy variables, for example, 31 provinces, autonomous regions and municipalities in mainland China are classified as north, Central, east, south, northwest, northeast, southwest and other regions, and then generate the dummy variables of the region.

Time variables cannot go directly into the modeling dataset, because time is infinitely growing, and the time in the historical data is certainly different from the time that occurs in the datasets that are required for future models, so models built directly using historical data cannot be applied to future datasets. If you want to consider time variables in the modeling process, you must convert them. There are several common conversions:

1. The length of time to be converted to a base time, for example, "The number of days xx months xx days from XX", "the number of weeks from the next Spring Festival" and so on.

2. Convert to seasonal information, for example, the first quarter or month of the year, each quarter or month corresponds to a dummy variable of two yuan.

In many cases, it is possible to consider multiple transformations of time, putting all the time information that might affect the dependent variable into the modeling process. For example, for some food purchases, there is not only a festive effect, but also a seasonal effect, it is necessary to use both of these conversions.

Reprint Address: http://www.cangfengzhe.com/sjwj/2895.html

Data mining processing classification independent variables and processing time variables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.