How to truncate/delete a database table from the AWS Glue job?
AWS Glue is an ETL tool developed & managed by Amazon, which uses PySpark underneath. An ETL job can have many types of data sources & destinations. It is pretty easy working with files as source/destination, just provide credentials, and a location to read & write.
Recently I was working on converting a SQL stored procedure to Glue job. My requirement was very simple. The interesting part was truncating the target table before inserting new data. Glue data frame doesn't provide overwrite
mode while loading into the table. I research a bit but didn't find any resources handling this & neither getting any direction. So how can we do this?
Something came to my mind, yes I can do this. Let us use driver's capability to do this. For this, I used PyMySQL
Python package.
import pymysql
import pymysql.cursors
# Connect to the database using db details or fetch these from Glue connections
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
database='db',
cursorclass=pymysql.cursors.DictCursor)
with connection:
with connection.cursor() as cursor:
# Create a new record
table_to_truncate = "test_table"
sql = f"truncate table {table_to_truncate}"
cursor.execute(sql)
# connection is not autocommit by default. So you must commit to save
# your changes.
connection.commit()
If you are working with Glue Studio to create a job in drag & drop way by selecting notes. Write a custom transformation & write this snippet inside this transformation. In other case, you can put it anywhere in the code file.
Still struggling to achieve this or stuck with something complex in this big data domain. Feel free to reach me.
Should the heading be “a database” or “MySQL” database?
Hi Rakesh,
We have a requirement werein we have to backup data from RDS(MYSQL and Postgres) to S3 and to delete the already copied data from RDS.
please suggest me how can i acheive this thru AWS GLUE.
Sure Sumant, DM me or schedule a session.
hi, i’m trying this but i got the next error:
IndentationError: expected an indented block (test_bookmark_2.py, line 25)
my line 25 is:
with connection.cursor() as cursor:
You have problem in your code, missing indentation. Indentation is base requirement for python.