IBM SPSS Modeler and Database Integration modeling and optimization
In the first two parts of this three-part series on IBM SPSS Modeler and Database Integration modeling and optimization, we talked about basic operations and integration modeling related to databases, and this section focuses on performance optimization in the process of using the database for SPSS Modeler integration, Includes the introduction of a variety of functions and the use of skills:
Data manipulation loopback
Model Loopback
Database Plug-ins-Rating service adapters
Calling database functions
Rearrange execution order
Summary of performance promotion skills
This paper assumes that readers are familiar with how to establish database connection and some other basic operations in SPSS Modeler, such as creating data streams, editing nodes and so on.
SQL Loopback Functionality
SQL Loopback is one of the most powerful features of IBM SPSS Modeler, which performs many data preparation and data mining operations directly in the database. As the SQL loopback name literally means, its rationale is to execute it by generating an SQL statement that can be executed inside the database and then loopback to the database. By executing in a database without having to perform on the same computer as IBM SPSS Modeler or IBM SPSS Modeler Server, you can reduce the amount of data that needs to be transferred (such as sampling, summarizing, Connection and so on) and using the high performance of database parallel processing and indexing function, the whole performance optimization of data mining process is obtained. It is for this reason that the design of SPSS Modeler makes the most possible use of SQL loopback functionality. However, due to different database differences, and standard SQL can express the content constraints, in the SPSS Modeler or a part of the operation can not support SQL loopback. If these features are used in streaming files, they will be executed on the SPSS Modeler server side rather than in the database.
If you need to use the SQL loopback feature, you first need to purchase the appropriate license code for this feature, and then enable Generate SQL, see the Help documentation. Of course, all SQL loopback needs to have one or more database source nodes as the data source.
Data manipulation loopback
Data operations that gain performance gains by loopback to a database can be broadly grouped into two categories that reduce the size of data operations and the operations that can take advantage of database performance processing. The former such as selection (select), Summary (aggregate), sampling (sample), filtering (filter) and other operations, such as sorting (sort), field derivation (derive) and so on. We are here to make a simple introduction to the selection exercise. As shown in Figure 1, write out the Modeler sample file drug1n to the database (use DB2 9.7 here) and then read it from the database.
Figure 1. Data Loopback Sample Flow
Assuming that we do not need all the data, such as only data for people older than 45 years, then the expression in the Select node can be written as "Age > 45", as shown in Figure 2.
Data Loopback Sample Select node
As shown in Figure 1, the top toolbar has a purple "SQL" icon, and when you select an executable node (such as a Table node in the figure) in the operation area of the Modeler, the button becomes clickable and clicks to preview the SQL loopback of the stream branch where the currently selected executable node is located. It does not really perform data operations, it simply generates the SQL statements for the current stream branch for further use by the user as a reference, and we will use this preview function repeatedly to generate SQL statements later in this article. After clicking on this icon, the stream will briefly enter the Non-editable state, generate the stream SQL statements, and show whether the convection branch can be loopback. A node highlighted in purple can be echoed to the database for execution, and other unchanged nodes will be executed locally in SPSS Modeler. As shown in Figure 3, the first two nodes in the lower branch of this example can be loopback to the database execution, and the last node is executed locally.
Data Loopback Sample Preview