Parquet: The right file format for ETL
ETL is an essential job in Data Engineering to make raw data easy to analyze and model training. In Arc we use Apache Airflow to run our ETL jobs.
Airflow model each work as a DAG(directed acyclic graph). DAG is an easy way to model the direction of your data during an ETL job. For example, a typical ETL job might have three tasks, extract, transform, and load.
Thus your DAG would look like this:
extract -> transform -> load
Sounds awesome, right? But the reality is always harder than you think. To improve the whole performance and availability, each task in your DAG will run independently, which means the tasks might run in a different machine. Thus, you can't pass the data to the next task, just like what you do in your daily programming. You need to save the result as a file, and the next task will try to read the file as input as complete its job.
So here comes the question, which data format should we use to transmit data between tasks?
The first file format that comes to our mind is CSV. CSV is good, human being can read the data directly, and our main data processing library Pandas have native support for CSV, which make it quite easy to read and write.
However, we encounter a problem when using CSV. It doesn't have a data type. Although pandas can guess the data type for you, it doesn't guess right every time, which means we need extra effort to care about the data format things, which is quite tedious.
Ok, maybe we need a file format that supports common data type, the first file format comes to my mind is JSON! We all love JSON. We used it everywhere, so let's save the data using JSON format!.
JSON works pretty well until we need to store date or timestamp and write the data to BigQuery, it always has a problem, every time we write a new DAG, we're not able to do it right on the first time.
Therefore we trying to find a file format that can work better than JSON, it needs to support all necessary data types includes date or timestamp, and here comes the hero Parquet!
Apache Parquet is a columnar file format for data serialization. It supports all frequently used data types(number, string, date etc.) and is designed to be space/IO-efficient
Unlike CSV and JSON, Parquet store data in binary, that's one of the reasons that it can store data efficiently, although it also means the file is not readable in your eye, you need to decode it first.
To use Parquet on Python, you need to install pyarrow first, pyarrow is the Python API of Apache Arrow. Apache Arrow is another library for data processing. Currently, we only use it to read and write Parquet file.
pip install pyarrow
Below is the example code:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# write data
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(
table, '/tmp/df.parquet', coerce_timestamps='ms',
allow_truncated_timestamps=True
)
# read data
df = pq.read_pandas('/tmp/df.parquet').to_pandas()
We add coerce_timestamps
and allow_truncated_timestamps
when writing Pandas dataframe to Parquet, because timestamp in BigQuery is microsecond precision, but since Pandas store timestamp in nanosecond precision it will raise an error if we do nothing.
After we use Parquet, we no longer got the data type problem. It is fantastic you can focus on the logic of ETL rather than stuck on some annoying data type problem. If you need a file format for ETL, give Parquet a try!