How I learned PySpark
About me
I'm Raviteja. I am presently working with American Express, based out of India, as a Business Analyst with a specific focus on Data Analytics and Decision Sciences. I have graduated from IIT Kharagpur majoring in Electrical Engineering.
Why I wanted to learn PySpark
I wanted to learn PySpark, which is Spark with Python cover, as I have read from multiple places that processing in a Spark environment is way faster than processing in Hive for various data processing operations I was required to perform.
How I approached learning PySpark
First, I started to understand all of the functionalities that can be achieved with Hive and PySpark. Is there anything that I can't do in PySpark that could act as a limitation for the kind of jobs that would be a part of my daily routine? These included performing row-wise operations, using aggregate functions, inter row operations, etc.
Challenges I faced
Hive is mostly SQL-based, while PySpark's RDD is most like Pandas. Since I have been working in SQL based Hive, Teradata, and SAS environments, initially, working on pipe-based processing of RDD proved to be a challenge.
Key takeaways
For tools using Pipe based processing, i.e. data_frame.operation1.operation2.operation3, the best way to visualize is that operations are performed from left to right, i.e. first one, followed by two, and then three.
Tips and advice
Well, firsthand experience suggests that using Spark for performing big data processing tasks is relatively faster than performing similar operations in a native Hive environment.
The data is written back to Disk after every job in Hive, while in Spark, the data is stored in memory until the object is killed. If there is no limitation of memory, then Spark is preferred.
If you are used to writing codes in SAS, moving with a PySpark environment could be challenging because of the difference in the way things are written.
Overall, functionalities that can be achieved in PySpark, as in SAS, are similar, as both of these are data processing languages as well.
Initially, when I adopted Hadoop-based Hive, I was having the problem where some native functionalities that were available easily in SAS weren't present, but with PySpark, these are overcome. Some functionalities could be Macro Functions, Global Macro variables, and operations on Macro Variables.
Final thoughts and next steps
I'm presently in the adoption phases of PySpark using Jupyter notebook. I would like to gain more knowledge to use it with ease and convenience that SAS used to offer and study various options that are available to better my every day work with ease and pace.
This article from Analytics Vidya was pretty helpful.