Model Selection Models Selection
An important task in ML is model selection, or using data to find the best model or parameter for a given task. This is also known as tuning. Individual estimators such as logistic regression can be adjusted, or the entire pipeline including multiple algorithms, features, and other steps may be adjusted. The user can adjust the entire pipeline at once without having to individually adjust each element in the pipeline.
Mllib supports model selection using tools such as Crossvalidator and Trainvalidationsplit. These tools require the following items:
Estimator: Algorithm or Pipeline to adjust
Set of Parammaps: Parameters that can be selected, sometimes referred to as "parameter grid"
Evaluator: How good it is to measure the FIT model's support for test data
At the high level, these model selection tools work as follows:
Divide the input data into separate training and test datasets.
For each (training, test) pair, traverse the collection of Parammap:
For each parammap, they use these parameters to fit the estimator, obtain the fitted model, and use evaluator to evaluate the model's performance.
Select the model that is generated by the best-performing parameter collection.
Evaluator can be a regressionevaluator for regression problems, binaryclassificationevaluator for binary data, Multiclassclassificationevaluator for a multi-class problem. The "Setmetricname method" in each evaluator is a default measure for selecting the best parammap.
To help construct the parameter grid, users can use the Paramgridbuilder utility.
SPARK2 model selection and tuning models selection and tuning