Random forest is a machine learning algorithm that is trusted by many data scientists for its robustness, accuracy and scalability.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
The algorithm trains multiple decision trees through bootstrap aggregation, and then predicts the output through integration. Due to its integrated characteristics, random forest is an algorithm that can be implemented in a distributed computing environment. Trees can be trained in parallel across processes and machines in a cluster, and the results are much faster than training using a single process.
In this article, we explored the use of Apache Spark to implement distributed random forest training on a cluster of CPU machines, and compared it with the training performance on a cluster of GPU machines using NVIDIA RAPIDS and Dask.
Although GPU computing is traditionally reserved for deep learning applications, RAPIDS is a library that performs data processing and non-deep learning ML work on the GPU, which can greatly improve performance compared to execution on the cpu.
We trained a random forest model using 300 million instances: Spark took 37 minutes on a 20-node CPU cluster, while RAPIDS took 1 second on a 20-node GPU cluster. The speed of the GPU has increased by more than 2000 times!
Experiment overview
We use the publicly available New York taxi data set and train a random forest regressor that can use attributes related to passenger pickup to predict the amount of taxi fare. Taking the number of taxi trips in 2017, 2018 and 2019 as the training set, there are a total of 300,700,143 examples.
Saturn Cloud can also use NVIDIA Tesla V100 GPUs to launch Dask clusters, but we chose g4dn.xlarge in this exercise to maintain a similar hourly cost profile to Spark clusters.
Spark
Apache Spark is an open source big data processing engine built in Scala. It has a Python interface and can call Scala/JVM code.
It is an important part of the Hadoop processing ecosystem, built around the MapReduce paradigm, and has interfaces for data frames and machine learning.
Setting up a Spark cluster is beyond the scope of this article, but once the cluster is ready, you can run the following command in Jupyter Notebook to initialize Spark:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.config('spark.executor.memory', '36g')
.getOrCreate())
The findspark package detects the Spark installation location on the system; if you can know the installation location of the Spark package, you may not need to do this.
To get performance Spark code, several configuration settings need to be set, depending on the cluster settings and workflow. In this case, we set spark.executor.memory to ensure that we will not encounter any memory overflow or Java heap errors.
RAPIDS
NVIDIA RAPIDS is an open source Python framework that executes data science code on GPUs instead of CPUs. Similar to what is seen when training deep learning models, this will bring huge performance improvements to data science work.
RAPIDS has interfaces such as data frame, ML, and graph analysis. RAPIDS uses Dask to handle parallelization with machines with multiple GPUs, and clusters of machines with one or more GPUs each.
Setting up a GPU machine can be a bit tricky, but Saturn Cloud has pre-built images for launching GPU clusters, so you can be up and running in just a few minutes! To initialize the Dask client pointing to the cluster, you can run the following command:
from dask.distributed import Client
from dask_saturn import SaturnCluster
cluster = SaturnCluster()
client = Client(cluster)
To set up a Dask cluster yourself, please refer to this docs page: https://docs.dask.org/en/latest/setup.html
Data loading
The data file is hosted on a public S3 bucket, so we can read the csv directly from there. All the files in the S3 bucket are in the same directory, so we use s3fs to select the files we want:
import s3fs
fs = s3fs.S3FileSystem(anon=True)
files = [f"s3://{x}" for x in fs.ls('s3://nyc-tlc/trip data/')
if'yellow' in x and ('2019' in x or '2018' in x or '2017' in x)]
cols = ['VendorID','tpep_pickup_datetime','tpep_dropoff_datetime','passenger_count','trip_distance',
'RatecodeID','store_and_fwd_flag','PULocationID','DOLocationID','payment_type','fare_amount',
'extra','mta_tax','tip_amount','tolls_amount','improvement_surcharge','total_amount']
With Spark, we need to read each CSV file individually and then combine them together:
import functools
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
# Manually specify the mode, because the inferSchema in read.csv is very slow
schema = StructType([
StructField('VendorID', DoubleType()),
StructField('tpep_pickup_datetime', TimestampType()),
...
# Refer to notebook for complete object mode
])
def read_csv(path):
df = spark.read.csv(path,
header=True,
schema=schema,
timestampFormat='yyyy-MM-dd HH:mm:ss',
)
df = df.select(cols)
return df
dfs = []
for tf in files:
df = read_csv(tf)
dfs.append(df)
taxi = functools.reduce(DataFrame.unionAll, dfs)
taxi.count()
Using Dask+RAPIDS, we can read all CSV files at once:
import dask_cudf
taxi = dask_cudf.read_csv(files,
assume_missing=True,
parse_dates=[1,2],
usecols=cols,
storage_options={'anon': True})
len(taxi)
Feature engineering
We will generate some features based on time, and then save the data frame. In both frameworks, this will perform all CSV loading and preprocessing, and store the results in RAM (GPU RAM in the case of RAPIDS). The features we will use for training include:
features = ['pickup_weekday','pickup_hour','pickup_minute',
'pickup_week_hour','passenger_count','VendorID',
'RatecodeID','store_and_fwd_flag','PULocationID',
'DOLocationID']
For Spark, we need to collect features into vector classes:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
taxi = taxi.withColumn('pickup_weekday', F.dayofweek(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_hour', F.hour(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_minute', F.minute(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_week_hour', ((taxi.pickup_weekday * 24) + taxi.pickup_hour).cast(DoubleType()))
taxi = taxi.withColumn('store_and_fwd_flag', F.when(taxi.store_and_fwd_flag =='Y', 1).otherwise(0))
taxi = taxi.withColumn('label', taxi.total_amount)
taxi = taxi.fillna(-1)
assembler = VectorAssembler(
inputCols=features,
outputCol='features',
)
pipeline = Pipeline(stages=[assembler])
assembler_fitted = pipeline.fit(taxi)
X = assembler_fitted.transform(taxi)
X.cache()
X.count()
For RAPIDS, we will all
Floating point values are converted to float32 for GPU calculation:
from dask import persist
from dask.distributed import wait
taxi['pickup_weekday'] = taxi.tpep_pickup_datetime.dt.weekday
taxi['pickup_hour'] = taxi.tpep_pickup_datetime.dt.hour
taxi['pickup_minute'] = taxi.tpep_pickup_datetime.dt.minute
taxi['pickup_week_hour'] = (taxi.pickup_weekday * 24) + taxi.pickup_hour
taxi['store_and_fwd_flag'] = (taxi.store_and_fwd_flag =='Y').astype(float)
taxi = taxi.fillna(-1)
X = taxi[features].astype('float32')
y = taxi['total_amount']
X, y = persist(X, y)
_ = wait([X, y])
len(X)
Training random forest
We only need a few lines of code to train a random forest.
Spark:
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(numTrees=100, maxDepth=10, seed=42)
fitted = rf.fit(X)
RAPIDS:
from cuml.dask.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, max_depth=10, seed=42)
_ = rf.fit(X, y)
result
We trained a random forest model on 300,700,143 New York taxi data instances on Spark (CPU) and RAPIDS (GPU) clusters. Both clusters have 20 working nodes, and the hourly price is roughly the same. Here are the results of each part of the workflow: