Motivation

With the increasing adoption of Spark for scaling ML pipelines, being able to install and deploy our own R libraries becomes especially important if we want to use UDFs.
In my previous post, I talked about scaling our ML pipelines in R with the use of SparkR UDFs.
Today I am going to discuss setting up a virtual environment for our SparkR run, ensuring that the run time dependencies and the libraries are installed on the cluster.

Constraints

For any Spark cluster, we can either install R and the required libraries on all the nodes in the cluster, in a one…


Scaling machine learning algorithms in R with SparkR

To scale our machine learning algorithms currently running in R, we recently undertook the activity of rewriting the entire data preprocessing and machine learning pipeline in SparkR.

We are using time series models STL and Holt for forecasting. The data is preprocessed to cast date columns with date datatypes, imputation, standardization and finally sort w.r.t. relevant columns for preparing input to ML algorithms. Since I was unfamiliar with R, the major challenge was to understand the data manipulations being performed in the current setup and find corresponding transformations in SparkR. …

Shubham Raizada

Senior Software Engineer @WalmartLabs Bangalore

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store