Apache Spark Java Tutorial: Simplest Guide to Get Started
This article is an Apache Spark Java Complete Tutorial, where you will learn how to write a simple Spark application. No previous knowledge of Apache Spark is required to follow this guide. Our Spark application will find out the most popular words in US Youtube Video Titles.
Firstly, I have introduced Apache Spark, its history, what it is, and how it works. And to continue, you will see how to write a simple Spark application.
History of Apache Spark
The history of Apache Spark emerged at UC Berkeley, where a group of researchers acknowledges the lack of interactivity of the MapReduce jobs. Depending on the dataset’s size, a large MapReduce job could take hours or even days to complete. Additionally, the whole Hadoop and MapReduce ecosystem was complicated and challenging to learn.
Apache Hadoop framework was an excellent solution for distributed systems introducing parallelism programming paradigm on distributed datasets. The worker nodes of a cluster will execute computations and aggregates results, producing an outcome. However, Hadoop has a few shortcomings: the system involves an elaborate set-up and not quite interactive. Fortunately, Apache Spark brought simplicity and speed to the picture.
What is Apache Spark?
Apache Spark is a computational engine that can schedule and distribute an application computation consisting of many tasks. Meaning your computation tasks or application won’t execute sequentially on a single machine. Instead, Apache Spark will split the computation into separate smaller tasks and run them in different servers within the cluster. Therefore, maximizing the power of parallelism.
Another critical improvement over Hadoop is speed. Using in-memory storage for intermediate computation results makes Apache Spark much faster than Hadoop MapReduce.
Architecture with examples
Apache Spark uses a master-slave architecture, meaning one node coordinates the computations that will execute in the other nodes.
The master node is the central coordinator which executor will run the driver program. The driver program will split a Spark job is smaller tasks and execute them across many distributed workers. The driver program will communicate with the distributed worker nodes through a SparkSession.
Write an Apache Spark Java Program
And finally, we arrive at the last step of the Apache Spark Java Tutorial, writing the code of the Apache Spark Java program. So far, we create the project and download a dataset, so you are ready to write a spark program that analyses this data. Specifically, we will find out the most frequently used words in trending youtube titles.
More Detailed Explanations of the below code
public class YoutubeTitleWordCount {
private static final String COMMA_DELIMITER = ",";
public static void main(String[] args) throws IOException {
Logger.getLogger("org").setLevel(Level.ERROR);
// CREATE SPARK CONTEXT
SparkConf conf = new SparkConf().setAppName("wordCounts").setMaster("local[3]");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
// LOAD DATASETS
JavaRDD<String> videos = sparkContext.textFile("data/youtube/USvideos.csv");
// TRANSFORMATIONS
JavaRDD<String> titles =videos
.map(YoutubeTitleWordCount::extractTitle)
.filter(StringUtils::isNotBlank);
JavaRDD<String> words = titles.flatMap( title -> Arrays.asList(title
.toLowerCase()
.trim()
.replaceAll("\\p{Punct}","")
.split(" ")).iterator());
// COUNTING
Map<String, Long> wordCounts = words.countByValue();
List<Map.Entry> sorted = wordCounts.entrySet().stream()
.sorted(Map.Entry.comparingByValue()).collect(Collectors.toList());
// DISPLAY
for (Map.Entry<String, Long> entry : sorted) {
System.out.println(entry.getKey() + " : " + entry.getValue());
}
}
public static String extractTitle(String videoLine){
try {
return videoLine.split(COMMA_DELIMITER)[2];
}catch (ArrayIndexOutOfBoundsException e){
return "";
}
}
}
I hope you enjoy this article, and thank you so much for reading and supporting this blog!