Build Spark machine learning model with Knime 2: Titanic Survival Forecast

Source: Internet
Author: User
Tags normalizer

In this paper, based on the spark decision tree Model algorithm, we train the Titanic's training data set containing the characteristics of passengers and crew, obtain the survival model of decision tree, and test the model with test data set (Knime).

1. Download training data set and test data set from Kaggle website

2, in Knime to create a new workflow, named: Titanicknimespark

3. Read the training data set

Knime supports reading data from a Hadoop cluster, in order to simplify the process of reading data sets directly from the local.

In the Search box in node repository, type CSV reader, locate the CSV reader node, and drag it into the canvas.

Double-click or right-click the CSV reader to configure the node to set the directory for the dataset.

Right-click the node, tap Excute, then right-click the file table to view the results.


4. Processing of missing values using the missing value node

Similar to the third step, find the missing value node and drag it into the canvas (similar to the following, no longer duplicates), and set properties as needed, where the missing values are handled with a simple averaging method. Establishes a CSV reader node connection to the missing value node.

Right-click on the node, tap Excute, then right-click on the output table to view the results.


5. Add the Create Spark context node and set spark context


6. Add the table to spark node, convert the Knime data table to Spark's Dataframe/rdd, configure the table to spark node and establish a connection to the table to spark node of the missing value node, Establishes a connection to the Create spark context node to the table to Spark node.

The default configuration is used here.


7. Add the Spark normalizer node, convert the survived property from a numeric type to a character type, configure the Spark Normalizer node, and establish a connection to the spark normalizer node of the table to spark node.

Right-click the node, tap Excute, then right-click the node and tap normalized Spark Dataframe/rdd to see the results.


8. Add the Spark Decision tree Learner node, configure the decision tree algorithm parameters, and establish a spark normalizer node connection to the Spark decision tree Learner node.

Right-click the node, tap Excute, then right-click the decision Tree model to view the results.


9 test the model with a test data set and spark Predictor node.

Copy the CSV reader,missing value and table to spark node and refer to 3,4,6 step to configure the read test data set and process and convert the data. Add the Spark Predictor node, configure the Spark Predictor node, and connect the newly added table to spark node with the Spark decision Tree Learner node and spark Predictor.

CSV reader configures the test data set.

Spark Predictor Node Configuration prediction column

Right-click the node, tap Excute, then right-click the node, and labled data to view the results.


10. You can add additional nodes to post-process the results, adding only the Spark Column filter node to filter out the unwanted column.

Add the Spark Column filter node and configure it.

Right-click the node, tap Excute, then right-click the node and tap filtered Spark Dataframe/rdd to see the results.

Finally the whole workflow as shown

Build Spark machine learning model with Knime 2: Titanic Survival Forecast

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.