Regression
Regression is the most simple and easy-to-use technology, but it may also be the least powerful (these two are always coming together, so interesting ). This model can be as simple as only one input variable and one output variable (known as scatter graphics in Excel, or xydigoal in OpenOffice.org ). Of course, it can be far more complex than that, and it can include many input variables. In fact, all regression models conform to the same general model. When multiple independent variables are combined, a result-one dependent variable can be generated. Then, the regression model is used to predict the result of an unknown dependent variable based on the value of the given independent variables.
Everyone may have used or seen regression models, or even created a regression model in their minds. One example that people can think of right away is pricing the house. The price (dependent variable) of a house is the result of many independent variables, including the area of the house, the size of the land occupation, whether the kitchen has granite, and whether the bathroom has just been reinstalled. Therefore, whether you have purchased a house or sold a house, you may create a regression model to price the house. This model is based on the price of other comparable houses in the neighboring area (model), and then puts the value of your own house into this model to generate an expected price.
Let's continue to use this housing pricing regression model as an example to create some real data. There are some houses for sale in my neighboring area and I am trying to find a reasonable price for my own house. I also need to use the output of this model to declare property taxes.
Table 1. Housing values of the regression model
House Area (square feet) |
Land Occupation |
Bedroom |
Granite |
Is the bathroom reinstalled? |
Sales Price |
3529 |
9191 |
6 |
0 |
0 |
$205,000 |
3247 |
10061 |
5 |
1 |
1 |
$224,900 |
4032 |
10150 |
5 |
0 |
1 |
$197,900 |
2397 |
14156 |
4 |
1 |
0 |
$189,900 |
2200 |
9600 |
4 |
0 |
1' |
$195,000 |
3536 |
19994 |
6 |
1 |
1 |
$325,000 |
2983 |
9365 |
5 |
0 |
1 |
$230,000 |
|
|
|
|
|
|
3198 |
9669 |
5 |
1 |
1 |
???? |
Good news is (or it may be bad news, depending on your own opinion) the simple introduction to the regression model mentioned above only involves a bit of skin, and such a touch will not even be really noticed. You can choose a university course for regression models. It will give you more information about regression models, or even more information you want to know. However, our introduction gives you full familiarity with this concept and is sufficient for the WEKA trial in this article. If you are more interested in regression models and data statistics, you can use your favorite search engine to search for the following terms: least Squares, homoscedasticity, normal distribution, white tests, lilliefors tests, R-squared, and p-values.
Build a dataset for WEKA
To load data to WEKA, we must put the data in a format that we can understand. The recommended data loading format for WEKA is Attribute-relation file format (ARFF). You can define the type of the loaded data and then provide the data itself. In this file, we define each column and the content contained in each column. For regression models, onlyNUMERIC
OrDATE
Column. Finally, each row of data is provided in a comma-separated format. The ARFF file we use for WEKA is as follows. Please note that my house is not included in the data row. Because we are creating models and the price of my house is unknown, we cannot enter my house.
Listing 1. WEKA File Format
@RELATION house@ATTRIBUTE houseSize NUMERIC@ATTRIBUTE lotSize NUMERIC@ATTRIBUTE bedrooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE bathroom NUMERIC@ATTRIBUTE sellingPrice NUMERIC@DATA3529,9191,6,0,0,205000 3247,10061,5,1,1,224900 4032,10150,5,0,1,197900 2397,14156,4,1,0,189900 2200,9600,4,0,1,195000 3536,19994,6,1,1,325000 2983,9365,5,0,1,230000
Load data into WEKA
After the data is created, you can create a regression model. Start WEKA, and then selectExplorer. The explorer screen appears, wherePreprocessTab is selected. SelectOpen FileAnd select the ARFF file created in the previous section. After selecting a file, WEKA Explorer should be similar to the screen snapshot shown in figure 3.
Figure 3. WEKA after house data loading
In this view, WEKA allows you to view the data being processed. On the left side of the Explorer window, all your data columns (attributes) and the number of data rows (instances) are provided ). If a column is selected, information about the column in the dataset is displayed on the right of the Explorer window. For exampleHousesizeColumn (it should be selected by default), and the statistics about this column will be displayed on the right side of the screen. It shows that the maximum value of this column in the dataset is 4,032 square feet, and the minimum value is 2,200 square feet. The average size is 3,131 square feet, and the standard deviation is 655 square feet (standard deviation is a statistical measure that describes the difference ). In addition, there is a visual way to view the data, clickVisualize allButton. Because the number of rows in this dataset is limited, visualization functions are not as powerful as most data points (for example, hundreds.
Well, we have already introduced enough data. Let's create a model immediately to get the price of my house.
Create a regression model with WEKA
To create this model, clickClassifyTab. The first step is to select the model we want to create so that WEKA knows how to process data and how to create an appropriate model:
- ClickChooseButton, and then expandFunctionsBranch.
- SelectLinearregressionYe.
This tells WEKA that we want to build a regression model. In addition, there are many other options, which indicate that there are many models that can be created. A lot! This also explains from another aspect that this article only introduces the topic. It is worth noting that. There is another option in the same branch, calledSimplelinearregression. Do not select this option, because simple regression can only have one variable, and we have six variables. After the correct model is selected, WEKA explorer is similar to Figure 4.
Figure 4. Linear Regression Model in WEKA
Can I use workbooks for the same purpose?
Simply put: No. After careful consideration, the answer is: yes. Most popular data table programs cannot easily implement the functions we implement with WEKA, that is, to define a linear model with multiple independent variables. However, youYesIt is very easy to implement a simple linear regression model (with an independent variable ). If you have enough courage, you can even perform a multi-variable regression, but this will be very difficult, it is definitely not as easy as using WEKA. In the reference section of this article, there is a Microsoft®Excel®.
Now, after selecting the desired model, we must tell WEKA where it should use data to create the model. Although it is clear that we want to use the data provided in the ARFF file, there are actually different options available, some of which are far more advanced than the options we will use. The other three options are:Supplied Test SetAllows you to provide a different dataset to build a model;Cross-validationLet WEKA build a model based on the provided data subset, and then obtain their average value to create the final model;Percentage splitWEKA builds a final model based on 1% of the data provided. These different choices are very useful for different models. We will see this in subsequent articles in this series. For regression, we can simply chooseUse training set. This will tell WEKA that the data we provide in the ARFF file can be used to build the model we want.
The last step to create a model is to select the dependent variable (that is, the column we want to predict ). In this example, the sales price of the house is what we want. At the bottom of these test options, there is a combo box that can be used to select this dependent variable. ColumnSellinupliceIt should be selected by default. If not, select it.
After creating the model, clickStart. Figure 5 shows the output result.
Figure 5. Housing Price Regression Model in WEKA
Parse this Regression Model
WEKA is not sloppy. It places the regression model directly in the output, as shown in Listing 2.
List 2. Regression output
sellingPrice = (-26.6882 * houseSize) + (7.0551 * lotSize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom) - 21661.1208
Listing 3 shows the result, where the price of my house has been inserted.
Listing 3. housing prices using regression models
sellingPrice = (-26.6882 * 3198) + (7.0551 * 9669) + (43166.0767 * 5) + (42292.0901 * 1) - 21661.1208sellingPrice = 219,328
However, looking back at the beginning of this article, we know that data mining is not just about outputting a value: it is about recognition patterns and rules. It is not strictly used to generate an absolute value, but to create a model for you to test the mode, predict the output, and draw conclusions based on the data. Let's further explain the model and conclusion that our model tells us apart from the housing price:
- Granite is irrelevant-WEKA will only use the columns that contribute statistically to the correctness of the model (measured in R-squared, but this is beyond the scope of this article ). It will discard and ignore the columns that are not helpful to the created model. Therefore, this regression model tells us that the granite in the kitchen will not affect the value of the house.
- The bathroom is related.-Because we use a simple value of 0 or 1 for the bathroom, we can use this coefficient from the regression model to determine the impact of the bathroom value on the value of the house. This model tells us that it has increased the value of the house by $42,292.
- The price of a large house is low.-WEKA tells us that the larger the house, the lower the sales price? This can be obtained from
houseSize
The negative coefficient before the variable is displayed. This model tells us that every one more square foot of the House will reduce house prices by $26? This is meaningless at all. This is in the United States! Of course, the bigger the house, the better, especially in my Texas region. So how can we explain this? This is a good example of useless data import and useless data export. The size of a house is not an independent variable. It is also related to the bedroom variable, because the house has many common bedrooms. So our model is not perfect. But we can fix this problem. Remember: InPreprocessTab to delete columns from the dataset. For this example, we deleteHousesizeColumn and create another model. How does it affect the price of a house? How can this new model be more practical? (The price of my house after modification is: $217,894 ).
A prompt for the statistician
This model breaks the requirements of a regular linear regression model, because each column is not completely independent and there is not enough data rows to generate an effective model. The main purpose of this article is to introduce the WEKA data mining tool, so we have greatly simplified the sample data.
To upgrade this simple example to a new level, let's take a look at a data file provided by the WEKA web site as a regression example. Theoretically, this is much more complicated than a simple example of our seven houses. The role of this sample data file is to create a regression model that can predict the fuel consumption (MPG per gallon) based on several features of the car (please remember, data taken from 1970 to 1982 ). This model includes the following attributes of a car: cylinder, displacement, horsepower, weight, acceleration, year, origin and manufacturer. In addition, this dataset has 398 rows of data, which is sufficient to meet our various statistical needs, which cannot be achieved in our house price model. Theoretically, this is an extremely complex regression model, WEKA may take a lot of time to create a model with so much data (but I guess you have foreseen that WEKA can process this data well ).
To use this dataset to generate a regression model, we need to strictly process the data according to the steps for processing house data, so I will not repeat it here. Continue and create this regression model. It generates the output shown in Listing 4.
Listing 4. mpg data regression model
class (aka MPG) = -2.2744 * cylinders=6,3,5,4 + -4.4421 * cylinders=3,5,4 + 6.74 * cylinders=5,4 + 0.012 * displacement + -0.0359 * horsepower + -0.0056 * weight + 1.6184 * model=75,71,76,74,77,78,79,81,82,80 + 1.8307 * model=77,78,79,81,82,80 + 1.8958 * model=79,81,82,80 + 1.7754 * model=81,82,80 + 1.167 * model=82,80 + 1.2522 * model=80 + 2.1363 * origin=2,3 + 37.9165
When you generate this model, you will see that WEKA handled the model in less than one second. Therefore, even if you are dealing with a powerful regression model with a large amount of data, computing is not a problem. This model seems to be much more complex than house data, but this is not the case. For example, the first line of this regression model,-2.2744 * cylinders=6,3,5,4
Indicates that if the car has 6 cylinders, it will put a 1 in this column. If the car has 8 cylinders, it will put a 0. Let's take a sample row (10th rows) from this dataset and put these values into the regression model to see if the output of this model is similar to the output provided to us by the dataset.
Listing 5. Sample MPG data
data = 8,390,190,3850,8.5,70,1,15class (aka MPG) = -2.2744 * 0 + -4.4421 * 0 + 6.74 * 0 + 0.012 * 390 + -0.0359 * 190 + -0.0056 * 3850 + 1.6184 * 0 + 1.8307 * 0 + 1.8958 * 0 + 1.7754 * 0 + 1.167 * 0 + 1.2522 * 0 + 2.1363 * 0 + 37.9165 Expected Value = 15 mpgRegression Model Output = 14.2 mpg
Therefore, when we test this model with randomly selected test data, this model performs very well. For a car with the actual value of 15 mpg, our predicted value is 14.2 mpg.
Conclusion
This article discusses the first data mining model: Regression Model (especially linear regression multi-variable model). It also shows how to use it in WEKA. This regression model is easy to use and can be used in many datasets. You will find that this model is one of the most useful models I have discussed in this series. However, data mining is not limited to simple regression. When different datasets and output requirements are different, you may find that other models may be better solutions.
Finally, I would like to reiterate that this article and subsequent articles in this series only give a brief introduction to the field of data statistics and data mining. People who spend a full semester studying data statistics and data mining can only be called "beginners ". Our goal is to let beginners fully appreciate the advantages of this available open-source tool and increase their understanding and attention on the value provided by data mining.
Reprinted from: http://www.ibm.com/developerworks/cn/opensource/os-weka1/
The data in the text is below the original article.