IBM SPSS Statistics Multi-variable predictive modeling

Last Update:2015-10-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Application background

1.1 Resolved issues

1) IT systems for large enterprises will be tested in their test environment beforehand for each application upgrade. How to ensure the validity of the test? How can the results of the test be inferred from the performance of the production environment?

2) as resource usage grows, resources such as CPU, memory, hard disk, I/O, and so on interact with each other and have potential associations. How do you gain insight into its relevance to guide your business in making reasonable capacity planning?

3) with the expansion of business, the load of enterprise production environment is increasing. How can you help your company make a corresponding capacity estimate by predicting future business volume and user volume growth?

4) How to provide automated, adaptive modeling process and predictive analysis for enterprise users to create a personalized scene automatically set up, automatic adjustment of the predictive model to reduce the use of complexity? How to ensure the validity and accuracy of predictive analysis?

1.2 Business Value

1) Avoid too much investment in test resources, maximize the value of test resources, and achieve the integration of testing and production resources.

2) Optimize the utilization of enterprise data center resources, reasonable ratio of resources, provide more accurate performance analysis and capacity planning solutions to save costs.

3) Reasonably forecast business growth, improve enterprise insight into future business, help enterprises to develop more complete capacity estimation and contingency plan.

4) To enhance business sustainability and user experience, to provide enterprises with automatic selection, modeling, adjustment, verification of source data-based life-cycle solutions.

2. Data preparation

Use a website to predict the resource utilization scenario after the production environment is on-line with the results of the test environment before the new business is launched. Starting from a small scope, first of all, for a server, the correlation analysis and prediction modeling are selected for the relevant index data. For example, in a multitude of servers, select one of the Web servers (192.168.119.9). The 00:00~24:00 of the server on January 1, 2013, the acquisition unit is minutes, a total of 1440 data for quantitative analysis.

The main purpose of this paper is to predict the future development trend of user access frequency frequency_user. Therefore, the relationship between user access frequency Frequency_user, memory utilization MEM, hard disk utilization disk and CPU utilization needs to be considered. The data file information is merged into a new data file, the data collated file is saved in the IBM SPSS Statistics SAV format storage file, 1, which contains the following fields: Date, time (acquisition unit: minutes), user access frequency Frequen Cy_user (units: times), Memory utilization MEM (in%), hard disk utilization disk (in%), User CPU utilization CPU (unit:%).

Figure 1. Data file variables

3. IBM SPSS Statistics Use process

3.1 Multivariate correlation analysis

In this paper, the correlation between user access frequency Frequency_user and CPU utilization, memory utilization MEM and disk utilization is determined by partial correlation analysis. Partial correlation analysis is that when two variables are related to many other variables at the same time, the effects of the other variables are eliminated, and only the process of correlation between the two variables is analyzed. Therefore, the partial correlation analysis can be used to analyze the correlation of multiple variables contained in this paper. For example, analyzing the relationship between the two variable access frequency Frequency_user and CPU utilization, it is necessary to eliminate the effect of memory utilization MEM and HDD utilization disk, only for the frequency of access frequency_user and the CPU utilization of the partial correlation analysis. The correlation coefficient R is adopted to determine whether the frequency_user is linearly correlated with the CPU. If the correlation is linear, the correlation relationship can be obtained. If the correlation is not linear, the relationship between the target variable and many other variables is judged by regression, that is, the importance of the predictor to the target variable. Multivariate Association analysis Flowchart, as shown in 2.

Figure 2. Multi-Variable Association analysis flowchart

3.1.1 Partial correlation analysis

1) Partial correlation procedure

Open the IBM SPSS Statistics, select from the menu: Analysis > Related > Partial correlation, go to the "Partial correlation" module method interface, 3 shows.

Figure 3. Partial Correlation Analysis Interface

In the Partial Correlation dialog box, select Frequency_user and CPU to enter the variable box, select MEM and DISK to enter the control box. A single-sided (one-tailed) or two-sided (two-tailed) test of the relevant coefficients is available in the "Significance test" box, and the two-sided test of this selection is shown in 4.

Figure 4. Selecting variables and Parameters

Click on the "Options" button to pop up the "Bias Dependency: Options" dialog box, which sets the relevant statistics, as shown in 5. This article sets the Frequency_user, CPU, MEM and DISK output "mean and standard deviation" and "0 order correlation coefficient", click the "Continue" button to return to the "Partial correlation" dialog box.

Figure 5. Partial dependency Options

2) Result description

According to the results of the partial correlation analysis, the mean value of Frequency_user is 85778.15992, the standard deviation is 43387.93355;cpu mean value is 33.84895%, the standard deviation is 9.304364;mem mean value is 36.93768%, the standard deviation is The mean value of the 6.954192;disk is 30.71943%, the standard deviation is 13.372261, and 6 is shown.

Figure 6. Descriptive statistics

The following shows the results of two partial correlation relationships, as shown in 7. First, in the absence of control variables, the correlation coefficients of frequency_user, CPU, MEM and DISK 22 are shown, and the probability and degrees of freedom of the two-sided test are presented. Secondly, in the case of setting MEM and DISK as control variables, the correlation coefficients and the probabilities and degrees of freedom of the Frequency_user with CPU 22 are shown. According to the results of two partial correlations, we can see that if the effect of mem and disk on Frequency_user and CPU is not eliminated, the correlation coefficient between Frequency_user and CPU is 0.622; if the mem and disk pairs are excluded Frequency_u Ser, CPU influence, Frequency_user and CPU correlation coefficient is 0.771.

Figure 7. Correlation

Where the value of the correlation is usually referred to as the correlation coefficient r. Correlation coefficient R is a good measure of the linear correlation between two variables, and the correlation coefficient r belongs to [1,+1]. If 0<r≤1, it indicates that there is a positive correlation between variables, and if 1≤r<0, there is a negative correlation between them. R = 1 fully positive correlation; R =-1: completely negative correlation; The two cases illustrate the existence of a functional relationship between variables. r = 0 Wireless sexual relations. |r|>0.8: Strong correlation; |r|<0.3: weak correlation, which can be considered irrelevant. The value of Frequency_user to CPU in this article is 0.771, and it needs to be further studied by regression analysis.

3.1.2 Regression analysis

1) regression analysis steps

Open IBM SPSS Statistics, select from the menu: Analysis > Regression > Automatic linear modeling, enter the "Automatic Linear Modeling" module method interface, as shown in 8.

Figure 8. Automated line-of-modeling interface

In the Automatic Linear Modeling dialog box, select Frequency_user as the target, and the CPU, DISK, and MEM as predictor variables (input) for automatic linear modeling, as shown in 9.

Figure 9. Automated line-of-modeling interface

2) Result description

Based on the importance of the predictor variables, the correlation analysis shows that the CPU has more than 80% importance to Frequency_user, and DISK and MEM are not more than 20%, 10. It is fully demonstrated that the CPU is the strongest correlation with frequency_user and has the highest interpretation ability.

Figure 10. Predictive variable importance

3.2 Predictive Modeling

In this paper, we select the user access frequency Frequency_user to study the prediction model. First, it determines the relationship between the target Predictor Frequency_user and other variables, CPU utilization, memory utilization MEM, and disk utilization. Based on the results of Multivariate Association analysis in section 3.1, the correlation variables in the Frequency_user prediction process are determined to be CPU utilization. Secondly, the optimal predictive model is selected. Models are modeled with an expert predictive model and an ARIMA predictive model. Again, adjust the model parameters. Finally, according to the forecast results, the user's satisfaction is judged. If the user is satisfied with the forecast result, the model is selected as the optimal model; If the user is dissatisfied with the forecast result, the expert forecast model and all parameters of the ARIMA forecast model are provided to the user, allowing the user to make the Prediction model selection and parameter adjustment, and continuously cycle the above steps until the user is satisfied with the prediction result. The modeling flowchart for the predictive model is shown in 11.

Figure 11. Modeling flowchart for predictive models

Screening Optimal predictive models

1) Modeling Steps

Open IBM SPSS Statistics, select from the menu: Analyze > Forecast > Create model, and go to the Time Series Modeler module method interface, shown in 12. In the Time Series Modeler dialog box, select Frequency_user as the dependent variable and the CPU as the independent variable to establish multiple predictive models.

Figure 12. Time Series Modeler

In the Statistics tab, select the fitting metrics for the output, for example: R-squared, RMS error, average absolute error percentage. Select each graph in the Chart tab to display the following: observations, predictions, and fit values. In the Save tab, on the one hand, set the predictive results of the save forecast model in the SAV file, and on the other hand, save the predictive model as an XML format, and when new data needs to be predicted, you can use this save result directly, without having to reconstruct the model, as shown in 13. In the Options tab, specify a point in time that you want to predict in the future, for example, this article has a 1-1440-minute observation, and you specify a forecast value of 1500 minutes to get a 1441-1500-minute forecast.

Figure 13. Saving a predictive model

2) Result description

Based on the fitting results, the optimal ARIMA (1,1,0) predictive model is selected for modeling, as shown in 14.

Figure 14. Model description

The output of the fitted metric, for example: R-squared, RMS error (RMSE), mean absolute error percentage (MAPE), 15 shown. In this paper, we select the indicator R square, rmse,mape the prediction results: The closer the R is to the 1,mape, the closer to 0, the better the fit of the model is, and the root-mean-square error indicates the degree of dispersion of the sample.

Figure 15. Model statistics

The observed, predicted, and fitted values of the Frequency_user are shown in 16. Where the horizontal axis represents the time (interval: minutes), the ordinate represents the user access frequency frequency_user (units: times).

Figure 16. Predictive results of predictive models

Model parameter Adjustment

In the Time Series Modeler dialog box, click the Condition button, shown in 17. The parameter adjustments for the predictive model are made.

Figure 17. Model parameter Adjustment

Enter the time Series Modeler: ARIMA condition. ARIMA (P,D,Q) is called the differential autoregressive moving average model, AR is autoregressive, p is the autoregressive term, MA is the moving average, Q is the number of moving averages, and D is the number of differential times when the time series becomes stationary. The range of P, D and q is generally [0,2], as shown in 18. You can set different parameter values for predictive modeling.

Figure 18.ARIMA Classification of predictive models

Conclusion

The Intelligent Capacity planning management solution predicts user access frequency Frequency_user by using the analysis capabilities in IBM SPSS Statistics. On the one hand, the impact of user access frequency Frequency_user and CPU utilization, memory utilization MEM, HDD utilization disk is fully considered, and the relationship between user growth and resources is analyzed accurately, and the correlation relation is used to guide enterprises to make reliable capacity analysis. On the other hand, Reasonable forecast of user access frequency Frequency_user business trends, improve business insight into the future business, help enterprises to develop more complete capacity estimation and contingency plan.

IBM SPSS Statistics Multi-variable predictive modeling

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

IBM SPSS Statistics Multi-variable predictive modeling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

IBM SPSS Statistics Multi-variable predictive modeling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support