With the increasing adoption of Spark for scaling ML pipelines, being able to install and deploy our own R libraries becomes especially important if we want to use UDFs.
In my previous post, I talked about scaling our ML pipelines in R with the use of SparkR UDFs.
Today I am going to discuss setting up a virtual environment for our SparkR run, ensuring that the run time dependencies and the libraries are installed on the cluster.
For any Spark cluster, we can either install R and the required libraries on all the nodes in the cluster, in a one size fits all fashion or create virtual environments as required.
In my case, we have a Cloudbreak cluster with [non-sudo] access only to the edge node for submitting Spark jobs. All the other cluster nodes are not accessible.
Due to these constraints, I cannot install R and any of the dependencies on either the edge node or the cluster. …
Scaling machine learning algorithms in R with SparkR
To scale our machine learning algorithms currently running in R, we recently undertook the activity of rewriting the entire data preprocessing and machine learning pipeline in SparkR.
We are using time series models STL and Holt for forecasting. The data is preprocessed to cast date columns with date datatypes, imputation, standardization and finally sort w.r.t. relevant columns for preparing input to ML algorithms. Since I was unfamiliar with R, the major challenge was to understand the data manipulations being performed in the current setup and find corresponding transformations in SparkR. …