Codementor Events

Spark & R: Downloading data and Starting with SparkR using Jupyter notebooks

Published Sep 17, 2015Last updated Mar 07, 2017
Spark & R: Downloading data and Starting with SparkR using Jupyter notebooks

In this tutorial we will use the 2013 American Community Survey dataset and start up a SparkR cluster using IPython/Jupyter notebooks. Both are necessary steps in order to work any further with Spark and R using notebooks. After downloading the files we will have them locally and we won't need to download them again. However, we will need to init the cluster in each notebook in order to use it.

In the next tutorial, we will use our local files to load them into SparkSQL data frames. This will open the door to exploratory data analysis and linear methods in future tutorials.

All the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.

Instructions

As we already mentioned, for these series of tutorials/notebooks, we have used Jupyter with the IRkernel R kernel. You can find
installation instructions for you specific setup here.

A good way of using these notebooks is by first cloning the repo, and then
starting your Jupyter in pySpark mode. For example,
if we have a standalone Spark installation running in our localhost with a
maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific
installation. So as requirement, you need to have
Spark installed in
the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passign options
described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when
calling IPython/pySpark.

Datasets

2013 American Community Survey dataset

Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 million
households are asked detailed questions about who they are and how they live. Many topics are covered, including
ancestry, education, work, transportation, internet use, and residency. You can directly to
the source
in order to know more about the data and get files for different years, longer periods, individual states, etc.

In any case, the starting up notebook
will download the 2013 data locally for later use with the rest of the notebooks.

The idea of using this dataset came from being recently announced in Kaggle
as part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggle
users. Highly recommended!

Getting and Reading Data

Let's first download the data files using R as follows.

population_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip'
housing_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_hus.zip'

library(RCurl)
population_data_file <- getBinaryURL(population_data_files_url)
    Loading required package: bitops
housing_data_file <- getBinaryURL(housing_data_files_url)

Now we want to persist the files, so we don't need to download them again in further notebooks.

population_data_file_path <- '/nfs/data/2013-acs/csv_pus.zip'
population_data_file_local <- file(population_data_file_path, open = "wb")
writeBin(population_data_file, population_data_file_local)
close(population_data_file_local)

housing_data_file_path <- '/nfs/data/2013-acs/csv_hus.zip'
housing_data_file_local <- file(housing_data_file_path, open = "wb")
writeBin(housing_data_file, housing_data_file_local)
close(housing_data_file_local)

From the revious we got two zip files, csv_pus.zip and csv_hus.zip. We can now unzip them.

data_file_path <- '/nfs/data/2013-acs'
unzip(population_data_file_path, exdir=data_file_path)

unzip(housing_data_file_path, exdir=data_file_path)

Once you unzip the contents of both files you will see up to six files. Each zip contains three files, a PDF explanatory document, and two data files in csv format. Each housing/population data set is divided in two pieces, "a" and "b" (where "a" contains states 1 to 25 and "b" contains states 26 to 50). Therefore:

  • ss13husa.csv: housing data for states from 1 to 25.
  • ss13husb.csv: housing data for states from 26 to 50.
  • ss13pusa.csv: population data for states from 1 to 25.
  • ss13pusb.csv: population data for states from 26 to 50.

We will work with these fours files in our notebooks.

Starting up a SparkR cluster

In further notebooks, we will explore our data by loading them into SparkSQL data frames. But first we need to init a SparkR cluster and use it to init a SparkSQL context.

The first thing we need to do is to set up some environment variables and library paths as follows. Remember to replace the value assigned to SPARK_HOME with your Spark home folder.

Sys.setenv(SPARK_HOME='/home/cluster/spark-1.5.0-bin-hadoop2.6')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

Now we can load the SparkR library as follows.

library(SparkR)
Attaching package: 'SparkR'

The following object is masked from ‘package:RCurl’:
    
  base64
    
The following objects are masked from ‘package:stats’:
    
 filter, na.omit
    
The following objects are masked from ‘package:base’:
    
 intersect, rbind, sample, subset, summary, table, transform

And now we can initialise the Spark context as in the official documentation. In our case we are use a standalone Spark cluster with one master and seven workers. If you are running Spark in local node, use just master='local'.

sc <- sparkR.init(master='spark://169.254.206.2:7077')
    Launching java with spark-submit command /home/cluster/spark-1.5.0-bin-hadoop2.6/bin/spark-submit   sparkr-shell /tmp/RtmpPm0py4/backend_port29c24c141b34 

And finally we can start the SparkSQL context as follows.

sqlContext <- sparkRSQL.init(sc)

Conclusions

And that's it. Once we get to this going, we are ready to load data into SparkSQL data frames. We will do this in the next notebook.

By using R on Spark, we will get the power of Spark clusters into our regular R workflow. In fact, as we will see in the following tutorials, the SparkR implementation tries to use the same function names we normally use with regular R data frames.

And finally, remember that all the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.

Discover and read more posts from Jose A Dianes
get started
post comments1Reply
Jeff van Geete
8 years ago

When I am trying to load the IRKernel package in R with IRkernel::installspec() I get the following error:

Starts with…
[TerminalIPythonApp] WARNING | Subcommand ipython kernelspec is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use jupyter kernelspec in the future
…TRACEBACK TEXT REDACTED…
ModuleNotFoundError: No module named ‘jupyter_client’

Any ideas?