SPSS Clementine data mining (2)

Source: Internet
Author: User

UseAdventure worksIn the databaseTarget mailFor example, a classification tree and a neural network model are established to predict who will respond to promotions and a neural network to predict annual income.

Target mailData is stored inSQL ServerSample DatabaseAdventureworksdwInDBO. vtargetmailView, aboutTarget mailFor details, see:

Http://technet.microsoft.com/zh-cn/library/ms124623.aspx#DataMining

Or my previous essays:

Http://www.cnblogs.com/esestt/archive/2007/06/06/773705.html

1.Define data sources

SetDatebaseAdd the source component to the data flow design area, double-click the component, and set the data sourceDBO. vtargetmailView.

 

InTypesClick"Read valuesWill automatically readType,Values.

 

ValuesIs the value of a field, for example, in a datasetNumbercardsownedThe field value is from0To4,HouseownerflagOnly1And0Two values.TypeYesValuesDetermine the field type,FlagThe type contains only two values, similarBoolean;SetIt refers to a finite number of values, similarEnumeration;RagngeIs a continuous value, similarFloat. By understanding the field type and value, we can determine which fields can be used as the prediction factor, suchAddressline,Phone,DatefirstpurchaseAnd other fields are useless, because the values of these fields are unordered and meaningless.

DirectionIndicates the field usage,"In"InSQL ServerCalled"Input","Out"InSQL ServerCalled"Predictonly","Both"InSQL ServerCalled"Predict","Partition"Is used to group data.

 

 

2.Understanding data

Before modeling, we need to know which fields are in the dataset, how these fields are distributed, and whether there is correlation between them. Only by understanding this information can we determine which fields to use and which mining applications to use.AlgorithmAnd algorithm parameters.

When you create a data sourceClementineIn addition to the value type, we can also use the output and graphics components to explore the data.

 

For example, drag a statistical component and a bar chart component into the data flow design area, connect to the data source component, configure these components, and click the green arrow above.

 

After a while, the two components will output the statistical report and bar chart, which will be saved in the management area (because the bar chart is an advanced visualization component, its output will not appear in the management area ), in the future, you only need to double-click the output in the management area to view the open report.

 

 

3.Prepare data

Delete the previous output and graphics tools from the data flow areas.

SetField OpsInFilterAdd the component to the data flowFilterYou can remove unnecessary fields.

 

We only need to useMaritalstatus,Gender,Yearlyincome,Tatalchildren,Numberchildrenathome,Englisheducation,Englishoccupation,Houseownerflag,Numbercarsowned,Commutedistance,Region,Age,BikebuyerThese fields.

JoinSampleThe component performs random sampling and extracts data from the source.70%As the training set.30%As a test set.

Note that if you specify a value for the seed, the computer should know that the pseudo-random sequence generated by the computer will not change as long as the seed remains unchanged.

Because we need to use two mining models, the input and prediction fields of the models are different, we need to add twoTypeComponent to distribute data.

The decision tree model is used to predict who will respond to promotions and buy bicycles.BikebuyerFields are used as prediction columns.

 

Neural networks are used to predict annual income.YearlyincomeSet as a prediction field.

 

Sometimes there are too many input fields for prediction, which will consume a lot of training time. You can useFeature SelectionFilter fields that have a greater impact on prediction fields.

SlaveModelingLieutenant GeneralFeature SelectionDrag the field out, connect to the component behind the neural network model, and clickExecute Selection.

 

Feature SelectionMODEL appears in the management area after model training. Right-click the model and selectBrowseYou can view the model content. Model slave12Fields are selected11Fields.11Fields have a big impact on annual income, so we only need to use this11Fields are used as input columns.

 

Drag the model from the management area to the data flow design area to replace the originalFeature SelectionComponent.

4.Modeling

JoinNearal netAndChaidModel component, inChaidIn component settings, SetModeItem"Launch interactive session".Click the green arrow above to execute the entire data flow.

ClementineDuring trainingChaidThe interactive session window is opened. In the interactive session, you can control the tree length and pruning to avoid overfitting. If you confirm the model, click the yellow icon above it.

 

After that, two more models are created in the management area.Drag them into the data flow design area to evaluate the model.

5.Model Evaluation

Modify the sampling component and setModeChange to"Discard sample", Which means to discard the one used to train the model.70%Data, leaving30%Data is used for testing. Do not change the seed.

 

Here I only testChaidDecision tree model. Integrate various componentsChaidModel Association.

 

After the execution, we can get the chart improvement and prediction accuracy table ......

6. Deployment model
The export component can use publish to publish data streams. Two files are generated, one is the PIM file and the other is the par file.

The PIM file stores all information about the stream, and the par file saves parameters. With these two files, you can use clemrun.exeto execute the flow. clemrun.exe is the execution of Clementine solution publisher.Program. Clementine solution publisher requires independent authorization. In ssis, when pimand parents are stored in a dtsx file, clemrun.exeis stored in dtexec.exe.
If you want to use models in other programs, you can use the clementine execution Library (clemrtl). Compared with Microsoft's ole db for DM, the APIS provided by SPSS are not very useful in development.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.