Setting up Isolated Virtual Environments in SparkR
Motivation
With the increasing adoption of Spark for scaling ML pipelines, being able to install and deploy our own R libraries becomes especially important if we want to use UDFs.
In my previous post, I talked about scaling our ML pipelines in R with the use of SparkR UDFs.
Today I am going to discuss setting up a virtual environment for our SparkR run, ensuring that the run time dependencies and the libraries are installed on the cluster.
Constraints
For any Spark cluster, we can either install R and the required libraries on all the nodes in the cluster, in a one size fits all fashion or create virtual environments as required.
In my case, we have a Cloudbreak cluster with [non-sudo] access only to the edge node for submitting Spark jobs. All the other cluster nodes are not accessible.
Due to these constraints, I cannot install R and any of the dependencies on either the edge node or the cluster.
Generating the environment##
Since we were currently running our ML algorithms in R, we had a docker image with R and all the ML libraries installed on it. I created a new image with Spark (v2.3.0 same as Cloudbreak cluster) installed on top of it.
Successful execution of the SparkR implementation of the ML algorithms [with smaller dataset] on this container ensures I can use this R installation directory for the setup of the virtual environment on the CloudBreak cluster.
Since we cannot install R directly on the Cloudbreak cluster due to permission constraints, I intended to ship the R installation directory from the container to the edge node.
install_spark.sh: Shell script for installing Spark.
yum -y install wget
wget — no-check-certificate https://www.scala-lang.org/files/archive/scala-2.11.8.tgz
tar xvf scala-2.11.8.tgz
mv scala-2.11.8 /usr/lib
ln -sf /usr/lib/scala-2.11.8 /usr/lib/scala
export PATH=$PATH:/usr/lib/scala/bin
wget — no-check-certificate https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
tar xvf spark-2.3.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.3.0-bin-hadoop2.7/* /usr/local/spark
export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11–2.3.0.jar
ln -sf /usr/bin/python3 /usr/bin/python
export PATH=$PATH:/usr/local/spark/bin
#Installation directory of R on container : /usr/lib64/R
Dockerfile: Dockerfile for creating a new image with Spark installed based on the existing image with R and ML libraries installed.
FROM <image_with_R_and_ML_libs_installed>:latest
COPY install_spark.sh ./
RUN bash install_spark.sh
ENV SPARK_EXAMPLES_JAR=”/usr/local/spark/examples/jars/spark-examples_2.11–2.3.0.jar”
Bootstrapping the environment##
SparkR running in Local Mode
I created a folder sparkr_packages in the edge node home directory and copied here the R installation directory and the packages from the container.
We also need to set some required environment variables.
export PATH=$HOME/sparkr_packages/R/bin:$PATH
export R_LIBS=$HOME/sparkr_packages/R/library
export RHOME=$HOME/sparkr_packages/R
export R_HOME=$HOME/sparkr_packages/R
The R installation requires certain compile-time dependencies which are not needed after the installation. Since we have successfully installed R on the container and validated, we will not require these dependencies on the edge node.
We would still need the run time dependencies which are required during the RScript execution. Without these libs present, starting up the R console will fail with an error like this
HOME/sparkr_packages/R/bin/exec/R: error while loading shared libraries: libtre.so.5: cannot open shared object file: No such file or directory
In my case, I needed libtre.so.5 and libpcre2–8.so.0 on the edge node.
These libs are also present in the container at /usr/lib64/. Just like the R installation directory I also copied them to the edge node at sparkr_packages.
We need to set the LD_LIBRARY_PATH to point to this location for R runtime to access these libs. We can also add these libs to R/libs to make them available for R runtime.
export LD_LIBRARY_PATH=HOME/sparkr_packages:$LD_LIBRARY_PATH
We can now start the SparkR console in local mode and run the UDF to validate the installation on the edge node.
SparkR running in Cluster-Mode
For using UDFs with SparkR running in cluster mode, the R installation directory and the run time dependencies must be present on all the executors. We also need to set the corresponding environment variables on each of the executors.
We can use spark-submit run time param archive to send the zipped sparkr_packages directory to all the executors.
-- archive : Takes a comma separated list of archives to be extracted into the working directory of each executor.
We can set the environment variables such as R_HOME, LD_LIBRARY_PATH and PATH for each executor by using the config spark.executorEnv.<property_name>= <property_value> during spark-submit.
Finally, start the SparkR session
sparkR —-master yarn —-conf spark.executorEnv.RHOME=./environment/sparkr_packages/R —-conf spark.executorEnv.R_HOME_DIR=./environment/sparkr_packages/R —-conf spark.executorEnv.PATH=./environment/sparkr_packages/R/bin:$PATH —-conf spark.executorEnv.LD_LIBRARY_PATH=./environment/sparkr_packages:$LD_LIBRARY_PATH —-num-executors 10 —-executor-cores 3 —-executor-memory 10g —-archives sparkr_packages.zip#environment
Conclusion##
Setting up the virtual environment like this is a bit cumbersome as we have to manually maintain the R executables and modules.
Still, this approach served us quite well and allowed us to set up a virtual environment without access to the cluster’s nodes.