Codementor Events

Apache Beam

Published Oct 16, 2018
Apache Beam

INTRODUCTION

Apache Beam is an evolution of the Dataflow model created by Google to process massive amounts of data. The name Beam (Batch + strEAM) comes from the idea of having a unified model for both batch and stream data processing. Programs written using Beam can be executed in different processing frameworks (via runners) using a set of different IOs.

On January 10, 2017, Beam got promoted as a Top-Level Apache Software Foundation project. It was an important milestone that validated the value of the project, legitimacy of its community, and heralded its growing adoption. In the past year, Apache Beam has experienced tremendous momentum, with significant growth in both its community and feature set.

Why we actually need Apache beam when we already have Hadoop/Spark/Flink ?

Well, there are many frameworks like Hadoop, Spark, Flink, Google Cloud Dataflow, etc that came into existence. But there has been no unified API that binds all these frameworks and data sources, and provide an abstraction to the application logic from big data ecosystem. Apache Beam framework provides abstraction between your application logic and big data ecosystem. To know about the comparison about Spark vs Beam check this.

Hence, there is no need to bother about the following aspects when you are writing your data processing application :

DataSource — Data source can be batches, micro-batches or streaming data
SDK — You may choose your SDK (Java, Python) that you are comfortable with to program your logic.
Runner — Once the application logic is written then you may choose one of the available runners (Apache Spark, Apache Flink, Google Cloud Dataflow, Apache Apex, Apache Gearpump (incubating) or Apache Samza) to run your application based on your requirements.

This is how, Beam lets you write your application logic once, and not mix and scramble the code with input specific parameters or runner specific parameters.

Before we start implementing our Beam application, we need to get aware of some core ideas that will be used later all the time. There are five main conceptions in Beam: Pipeline, PCollection, PTransform, ParDO and DoFn.

Pipeline: A Pipeline encapsulates the workflow of your entire data processing tasks from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run.

PCollection: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism

PTransform: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.

ParDo: ParDo is a Beam transform for generic parallel processing. The ParDo processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.

DoFn: A DoFn applies your logic in each element in the input PCollection and let you populate the elements of an output PCollection. To be included in your pipeline, it’s wrapped in a ParDo PTransform.

The Google Cloud Dataflow Runner uses the Cloud Dataflow managed service. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform. When using Java, you must specify your dependency on the Dataflow Runner in your pom.xml.

<dependency>
    <groupId>com.google.cloud.dataflow</groupId>
    <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
    <version>2.5.0</version>
</dependency>
Discover and read more posts from Juan Añez
get started