Codementor Events

Amazon Athena? Amazon Redshift? Which Should You Use?

Published Jan 26, 2018

Amazon Athena and Amazon Redshift are cloud-based data services provided by Amazon Web Services.

This post will help you choose between both services by detailing some pros and cons for Amazon Athena and Amazon Redshift and a comparison in terms of pricing, performance, and user experience.

When you finish reading, you'll be better informed on whether Athena or Redshift can meet your data needs.

Amazon Athena - Pros and Cons

Amazon Athena provides serverless querying of stored data on Amazon S3 using standard SQL syntax. Athena is based on the Presto SQL query engine, enabling you to query data in several formats including JSON, CSV, log files, text with custom delimiters, Apache Parquet, and Apache ORC

With a serverless architecture, AWS dynamically manages the allocation of computing resources in Athena by handling server management and capacity planning decisions for the user.

Pros of Amazon Athena

Quick and easy to set up because you don't provision computing resources or manage servers—AWS does it for you
You can query data where it is housed, in its native format
Simple cost structure, with the price based only on queried data

The serverless architecture is a particularly useful advantage of Athena—since AWS abstracts away infrastructural management from the user, operational costs are lower.

Cons of Amazon Athena

You can only read data; you can't write back to S3.
There are limitations on query times, concurrent queries, and Amazon S3 buckets. You need to contact AWS if you want to increase the limits.
There is no support for user-defined functions (UDFs) in Amazon Athena.

Amazon Redshift: Pros and Cons

Amazon Redshift is a petabyte-scale data warehouse that handles analytic workloads on large data sets, including business intelligence and OLAP applications.

Pros of Amazon Redshift

A massively parallel processing (MPP) architecture and columnar storage combine to handle complex analytical queries effortlessly.
Redshift is linearly scalable to meet growing needs for data storage and querying—simply add more nodes (servers) to scale the service.
Supports data compression at the column-level to conserve storage space and reduce the disk I/O for queries.

Cons of Amazon Redshift

ETL is non-trivial—you'll need to understand particular nuances for loading data efficiently into Redshift with the COPY command.
Because there is no separation of compute and storage, you pay for unused computing resources when you're not analyzing the data.

Feature and Usage Comparison: Athena vs. Redshift

Initialization Time

Amazon Athena is much quicker to initialize— you can get started straight away by querying data on Amazon S3. Amazon Redshift requires preparing clusters and loading data into tables, both of which can take a significant amount of time compared with Amazon Athena.

UDFs

Redshift supports custom scalar user-defined functions (UDFs) using SQL SELECT or a Python program that executes when the function is called and returns a single value (scalar).

Athena has no support for any type of UDF, which is not helpful for users with requirements that call for a UDF.

Data Formats

Athena supports several data formats, including TSV, JSON, CSV, Parquet, ORC, and all compressed formats apart from LZO. Redshift supports CSV, JSON, TSV, and Apache web logs.
Data Types

Athena edges out Redshift because it supports complex data types, including arrays and Object Identifier data types. Redshift doesn't support complex data types.

Pricing

Amazon Athena costs $5 per terabyte of data scanned during a query. The caveat with this pricing structure is that it is a challenge to estimate the cost of the service. You can overcome the difficulty of estimating the cost of the service by by compressing, partitioning, and converting all your data into columnar formats, which saves 30% to 90% on your per-query costs. You can expect Athena to work out cheaper than Redshift.

Redshift operates two types of pricing structures. The on-demand cost is calculated hourly based on the type of instance you choose. A DC instance costs $0.250 to $4.800 per hour, while a DS instance costs $0.850 to $6.800 per hour. By committing to using the service for a time period between one and three years, you can pay for Reserved Instances, which come at a discount over the on-demand costs.

Performance

Creating tables in Redshift is faster than in Athena. A simple SELECT query runs much quicker in Athena than Redshift. Both services perform similarly for aggregated queries, although Athena has a slight edge. Redshift outperforms Athena when performing complex JOIN queries because of its computational capacity.

The below table displays a performance comparison for each service using the same table structure and underlying data in S3. All measurements are in seconds to execute the query.

User Experience

Athena provides a simple interface. Getting started is easy—simply create a database, choose a table name, and specify where the data is on Amazon S3. You can add columns in bulk and easily partition tables in Athena.

Redshift requires you to configure cluster properties. The user interface is relatively intuitive, with explanations to help guide you on the details of each field. Bear in mind that you'll need to wait for your cluster to become active and ready to use, which can take from 10-30 minutes.

Syntax

Anyone familiar with SQL can use standard syntax to query data on Athena. Redshift is based on PostgreSQL, meaning anyone familiar with basic SQL can easily adapt to the syntax. However, there are important differences between Redshift and PostgreSQL, including a significant change in the functioning of the VACUUM command, which is used to clean up and reorganize tables.

Summary of Amazon Athena vs Redshift

Choosing between Amazon Athena and Amazon Redshift comes down to your preferences and what you need to do with your data.

The comparison here shows that Amazon Athena stands out for its portability, ease of use, and the fact that it supports more data formats and data types. Athena is also cheaper than Redshift. In addition, Athena's performance is better for basic table scans and small aggregations, typical of the type of querying required by most small to medium sized businesses. Considering you can also use BI tools with Athena with JDBC, Amazon Athena works out as the preferred option for the majority of users.

For huge numbers of transactions on large datasets, Redshift offers better performance and horizontal scalability. Furthermore, there are no query limits on Redshift, which is beneficial when you need to trigger large numbers of queries. Data analysis at this scale is only typical of large enterprises, though.

Discover and read more posts from Ronan
get started