Codementor Events

Enabling Spark UI and Ganglia for EMR Cluster

Published Aug 01, 2019
Enabling Spark UI and Ganglia for EMR Cluster

If you are already here, you already have been running your EMR Cluster and trying to figure out the various metrices you can monitor to tune your cluster resource usage.

What is EMR?
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.

EMR Console
Screenshot 2019-07-31 at 16.18.31.png This is a typical EMR Console.

While running a cluster, especially when a pipeline is just developed there is lot of scope to improve in terms of cluster resource usage. Few issues that can be solved using the Spark UI and Ganglia are:

  • Fix Resource Allocation Error(Beleive me this will probably be yours first issue on EMR )
  • Resource Memory/CPU/Network Usage
  • Identify chokepoint in your pipeline

Spark UI
Screenshot 2019-07-31 at 16.30.35.png

Ganglia UI
gangalia.png

How to enable these services available in your browser

There are multiple ways you can do it. I will only explain how I do it.

It is a two step process.
- Step 1: Setup an ssh tunnel to the master node with port forwarding
- Step 2: Configure browser Proxy settings to access webpage hosted on the master node.

Step 1

Case: If you are not restricted to access the master node from your local machine.

ssh -i ~/emr_key.pem -ND 8157 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com

###-##-##-### is the IP of master node in the cluster.

Screenshot 2019-08-01 at 12.13.29.png
Look for Master public DNS in the Summary tab of cluster console.

Case: If you are restricted to access the master node from your local machine. In this type of setup, normally access(inbound and outbound traffic) is limited by EC2 Security Groups.

Screenshot 2019-08-01 at 13.02.22.png

In this case you should establish a tunnel with the instance which has the access to the master node.

export EMR_MASTER=ip-xxx-xx-xxx-xx.eu-central-1.compute.internal
ssh -L 8821:"$EMR_MASTER":22 user@instance-which-has-access-to-emr-master

After this command, that traffic through port 8821 is automatically “tunneled” over the SSH connection and sent to the master node. The SSH server sits in the middle, forwarding traffic back and forth.

ssh -i ~/emr_key.pem -ND 8157 -p 8821 hadoop@localhost

Step 1 is complete.

Note: If you want ssh to the master node, just do

ssh -i ~/emr_key.pem -p 8821 hadoop@localhost

Step 2

  • Add FoxyProxy Chrome Extension
  • Create a foxyproxy-settings.xml file, paste this configuration as it is.
<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
   <proxies>
      <proxy name="emr-socks-proxy" id="2322596116" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false">
         <matches>
            <match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*10*.amazonaws.com*" pattern="*10*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            <match enabled="true" name="*10*.compute*" pattern="*10*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" /> 
            <match enabled="true" name="*.compute.internal*" pattern="*.compute.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>
            <match enabled="true" name="*.ec2.internal* " pattern="*.ec2.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>	  
     </matches>
         <manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" />
      </proxy>
   </proxies>
</foxyproxy>
  • After you have saved this configuration, click on the chrome extension icon

Screenshot 2019-08-01 at 13.28.04.png

Then, select Options, then Click on Import/Export, browse the file foxyproxy-settings.xml and select it.
Screenshot 2019-08-01 at 13.29.58.png

  • Go to your EMR console, reload the page

The EMR Console page will change change from this
Screenshot 2019-08-01 at 13.31.04.png

to this

Screenshot 2019-08-01 at 13.31.45.png

Congratulations!! You have configured Spark UI and Ganglia on your browser.

2016-03-15-dark-side-of-science-meme.jpg

Official Links:

I hope you have learnt something. Please share your experience and feedback.
Namaste 🙏

Discover and read more posts from Amit Kushwaha
get started