Some data mining methods can deal with categorical independent variables directly, but many data mining methods can only deal with numerical variables, such as linear regression, neural network, and so on, it is necessary to convert categorical arguments into numerical arguments when using these methods.
For fixed-order arguments, one of the most commonly used conversions is to convert the variable directly to a numeric argument by the sequence number of each category. For nominal arguments, the most common conversion is to convert the variable to a dummy variable. For example, for gender, you can generate a two-dollar dummy variable, with a value of 1 for "female" and 0 for "male."
For nominal independent variables with multiple values, a series of two-yuan dummy variables can be generated. For example, there are 31 provinces, autonomous regions and municipalities in mainland China, which can generate 30 dummy variables accordingly. However, if a nominal argument has too many values, generating too many dummy variables can lead to overfitting. A simple and effective way is to generate dummy variables only for categories that contain more observations, and to attribute the remaining categories to the "other" category. Another way is to use domain knowledge to classify the categories into several major categories and then generate dummy variables, for example, 31 provinces, autonomous regions and municipalities in mainland China are classified as north, Central, east, south, northwest, northeast, southwest and other regions, and then generate the dummy variables of the region.
Time variables cannot go directly into the modeling dataset, because time is infinitely growing, and the time in the historical data is certainly different from the time that occurs in the datasets that are required for future models, so models built directly using historical data cannot be applied to future datasets. If you want to consider time variables in the modeling process, you must convert them. There are several common conversions:
1. The length of time to be converted to a base time, for example, "The number of days xx months xx days from XX", "the number of weeks from the next Spring Festival" and so on.
2. Convert to seasonal information, for example, the first quarter or month of the year, each quarter or month corresponds to a dummy variable of two yuan.
In many cases, it is possible to consider multiple transformations of time, putting all the time information that might affect the dependent variable into the modeling process. For example, for some food purchases, there is not only a festive effect, but also a seasonal effect, it is necessary to use both of these conversions.
Reprint Address: http://www.cangfengzhe.com/sjwj/2895.html
Data mining processing classification independent variables and processing time variables