Enabling Spark UI and Ganglia for EMR Cluster
If you are already here, you already have been running your EMR Cluster and trying to figure out the various metrices you can monitor to tune your cluster resource usage.
What is EMR?
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
EMR Console
This is a typical EMR Console.
While running a cluster, especially when a pipeline is just developed there is lot of scope to improve in terms of cluster resource usage. Few issues that can be solved using the Spark UI and Ganglia are:
- Fix Resource Allocation Error(Beleive me this will probably be yours first issue on EMR )
- Resource Memory/CPU/Network Usage
- Identify chokepoint in your pipeline
Spark UI
Ganglia UI
How to enable these services available in your browser
There are multiple ways you can do it. I will only explain how I do it.
It is a two step process.
- Step 1: Setup an ssh tunnel to the master node with port forwarding
- Step 2: Configure browser Proxy settings to access webpage hosted on the master node.
Step 1
Case: If you are not restricted to access the master node from your local machine.
ssh -i ~/emr_key.pem -ND 8157 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
###-##-##-### is the IP of master node in the cluster.
Look for Master public DNS in the Summary tab of cluster console.
Case: If you are restricted to access the master node from your local machine. In this type of setup, normally access(inbound and outbound traffic) is limited by EC2 Security Groups.
In this case you should establish a tunnel with the instance which has the access to the master node.
export EMR_MASTER=ip-xxx-xx-xxx-xx.eu-central-1.compute.internal
ssh -L 8821:"$EMR_MASTER":22 user@instance-which-has-access-to-emr-master
After this command, that traffic through port 8821 is automatically “tunneled” over the SSH connection and sent to the master node. The SSH server sits in the middle, forwarding traffic back and forth.
ssh -i ~/emr_key.pem -ND 8157 -p 8821 hadoop@localhost
Step 1 is complete.
Note: If you want ssh to the master node, just do
ssh -i ~/emr_key.pem -p 8821 hadoop@localhost
Step 2
- Add FoxyProxy Chrome Extension
- Create a
foxyproxy-settings.xml
file, paste this configuration as it is.
<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
<proxies>
<proxy name="emr-socks-proxy" id="2322596116" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false">
<matches>
<match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
<match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
<match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
<match enabled="true" name="*10*.amazonaws.com*" pattern="*10*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
<match enabled="true" name="*10*.compute*" pattern="*10*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
<match enabled="true" name="*.compute.internal*" pattern="*.compute.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>
<match enabled="true" name="*.ec2.internal* " pattern="*.ec2.internal*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false"/>
</matches>
<manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" />
</proxy>
</proxies>
</foxyproxy>
- After you have saved this configuration, click on the chrome extension icon
Then, select Options
, then Click on Import/Export
, browse the file foxyproxy-settings.xml
and select it.
- Go to your EMR console, reload the page
The EMR Console page will change change from this
to this
Congratulations!! You have configured Spark UI and Ganglia on your browser.
Official Links:
- Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding
- Configure Proxy Settings to View Websites Hosted on the Master Node
I hope you have learnt something. Please share your experience and feedback.
Namaste 🙏