GCE BigQuery vs AWS Redshift vs AWS Athena

Published on Jul. 20, 2017 by Gabriel Bordeaux

Updates

7/20/17 update

Following Tino Tereshko's advice (he is Big Data Lead at Google Cloud Office of CTO), I added the metrics for BigQuery standard SQL and re-calculated the data loading time (from Google Cloud Storage to BigQuery) following their recent optimizations.

7/6/17 update

With Joe Harris' help (he is a Redshift Database Engineer at AWS), I measured the performance of an optimized schema on both dc1.large and ds2.xlarge Redshift instances. The benchmark below has been updated with these new tests.

6/22/17 update

Almost 3,000 people read the article and I have received a lot of feedback. This is the first update of the article and I will try to update it further later.

I converted the CSV format to Parquet and re-tested Athena which did give much better results as expecte (Thanks Rahul Pathak, Alex Casalboni, openasocket, Robert Synnott, the amazon Redshift team with Joe Harris, Jenny Chen, Maor Kleider and the Amazon Athena/EMR team with Abhishek Sinha)
Using Redshift admin tables I was able to add the data scanned per query for Redshift (Thanks rockostrich)
I added a note about partitioning
I added a links section with useful articles Finally, a few people reached out asking for the dataset to try to load it and benchmark the performance on other databases. I will definitely share a link to their article if they publish one!

Introduction

GCE BigQuery vs AWS Redshift vs AWS Athena

This article is a basic comparison on data loading and simple queries between Google BigQuery and Amazon Redshift and its cousin Athena. For this test we will be loading a CSV/Parquet file which is basically an enlarged version of the STAR Experiment star2002 dataset. The final file is close to 1Tb (997Gb). Some information regarding the dataset:

CSV size: 997GB (~1TB)
Parquet size: 232GB
7 928 812 500 lines (~8 billion)
16 coma separated columns
All columns are either integers, double precision or floats

Loading time

To calculate load time I initially sent the files to both Amazon S3 and Google Cloud Storage then loaded them into each datastore. The times below reflects the load time from S3 / Google Cloud Storage to the datastore and not the time it took to transfer the file from a server to both cloud storage solutions.

BigQuery Legacy SQL	BigQuery Standard SQL	Redshift Dense Compute dc1.8xlarge x2 nodes	Redshift Dense Storage ds2.xlarge x1 node	Redshift Dense Compute dc1.large x3 nodes optimized schema	Redshift Dense Storage ds2.xlarge x2 nodes optimized schema	Athena (CSV)	Athena (Parquet)
9m	9h 30m	8h 23m	31m (with import file split in 50 files)	28m (with import file split in 50 files)	0s (no need to load the data)	0s (no need to load the data)

BigQuery

Legacy SQL

BigQuery

Standard SQL

Redshift

Dense Compute

dc1.8xlarge

x2 nodes

Redshift

Dense Storage

ds2.xlarge x1 node

Redshift

Dense Compute

dc1.large

x3 nodes

optimized schema

Redshift

Dense Storage

ds2.xlarge

x2 nodes

optimized schema

Athena

(CSV)

Athena

(Parquet)

9h 30m

8h 23m

31m

(with import file split in 50 files)

28m

(with import file split in 50 files)

(no need to load the data)

The above information may or may not be relevant in your use case. For the purpose of this test it was easier to load a large dataset from S3 / Google Cloud Storage but in most cases data will be streamed directly from an application. During my last round of tests with Redshift I split my CSV in 50 2GB files instead of using one 1TB file as shown above. It allowed Redshift to allocate the different files to different nodes and load them in parallel, which dramatically improved performance. (read more)

Queries speed

After loading the same exact dataset in each big datastore I have tested the query time of a few sample queries against each one. As expected and since none of the datastores were MongoDB, the number of rows and results were consistent across each datastore.

Query	BigQuery Legacy SQL	BigQuery Standard SQL	Redshift Dense Compute dc1.8xlarge x2 nodes	Redshift Dense Storage ds2.xlarge	Redshift Dense Compute dc1.large x3 nodes optimized schema	Redshift Dense Storage ds2.xlarge x2 nodes optimized schema	Athena (CSV)	Athena (Parquet)	Rows found
SELECT count(*) FROM t	2.2s	2.8s	3.2s	16.7s	3.0s	3.0s	1m 16s	3.76s	7,928,812,500
SELECT count(*) FROM t WHERE eventnumber > 1	2.6s	3.2s	4.3s	56.2s	6.56s	16.9s	1m 34s	8.72s	7,928,486,500
SELECT count(*) FROM t WHERE eventnumber > 20000	3.0s	4.3s	3.0s	44.2s	1.99s	1.65s	1m 32s	7.75s	3,871,550,500
SELECT count(*) FROM t WHERE eventnumber > 500000	4.1s	2.9s	1.6s	10.6s	0.07s	0.08s	1m 32s	7.25s	42,853,500
SELECT eventFile, count(*) FROM t GROUP BY eventFile	17.5s	7.1s	15.2s	2m 8s	15.01s	12.66s	1m 36s	8.1s	102,021
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile	2.8s	8.6s	3.0s	19.4s	1.52s	1.28s	1m 33s	9.45s	137
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC	7.2s	7.3s	12.4s	27.1s	3.13s	2.95s	1m 33s	11.79s	3,007
SELECT MAX(runNumber) FROM t	2.4s	2.8s	3.8s	41.6s	9.53s	8.73s	1m 34s	6.29s	1
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000	3.3s	3.3s	13.0s	1m 44s	6.11s	5.80s	1m 44s	9.71s	1
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile	14.5s	23.2s	18.2s	4m 18s	20.49s	20.38s	1m 54s	22.49s	38,400

The fastest datastore has its query time highlighted in green. In the cases where 2 datastores have very similar query times (< .5s difference) they will be both highlighted.

Queries cost

BigQuery and Athena both cost $5/TB. Looking at the amount of data scanned, I was able to calculate approximately the cost of each query. Redshift charges hourly which makes it very difficult to compare. If you have a dataset with a linear growth and an expectable volume of data scanned per query you could calculate the point at which it would be cheaper to pay an hourly price versus a data scanned based price. I could have calculated this data here but it would have very little value since these queries and this dataset is probably very different from most usages.

Query	BigQuery Legacy SQL	BigQuery Standard SQL	Redshift Dense Compute dc1.8xlarge x2 nodes	Redshift Dense Storage ds2.xlarge	Redshift Dense Compute dc1.large x3 nodes optimized schema	Redshift Dense Storage ds2.xlarge x2 nodes optimized schema	Athena (CSV)	Athena (Parquet)
Dataset storage cost	$19.80 / month		$0				$23.00 / month (S3 pricing)	$5.34 / month (S3 pricing)
SELECT count(*) FROM t	$0 (cached)	$0 (cached)	$4.8 / hour / node x2 nodes	$0.85 / hour / node	$0.25 / hour / node x3 nodes	$0.85 / hour / node x2 nodes	$4.43	$0 (cached)
SELECT count(*) FROM t WHERE eventnumber > 1	$0.30	$0.30					$4.43	$0.14
SELECT count(*) FROM t WHERE eventnumber > 20000	$0.30	$0.30					$4.43	$0.14
SELECT count(*) FROM t WHERE eventnumber > 500000	$0.30	$0.30					$4.43	$0.14
SELECT eventFile, count(*) FROM t GROUP BY eventFile	$0.30	$0.30					$4.43	$0.01
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile	$0.59	$0.59					$4.43	$0.15
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC	$0.89	$0.89					$4.43	$0.18
SELECT MAX(runNumber) FROM t	$0.30	$0.30					$4.43	$0.01
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000	$0.59	$0.59					$4.43	$0.17
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile	$1.48	$1.48					$4.43	$0.28

Data scanned per query

Query	BigQuery Legacy SQL	BigQuery Standard SQL	Redshift Dense Compute dc1.8xlarge x2 nodes	Redshift Dense Storage ds2.xlarge	Redshift Dense Compute dc1.large x3 nodes optimized schema	Redshift Dense Storage ds2.xlarge x2 nodes optimized schema	Athena (CSV)	Athena (Parquet)
SELECT count(*) FROM t	0 GB	0 GB	59.07 GB	59.07 GB	59.07 GB	59.07 GB	885.5 GB	0 GB
SELECT count(*) FROM t WHERE eventnumber > 1	59.1 GB	59.1 GB	118.14 GB	118.14 GB	118.14 GB	118.14 GB	885.47 GB	27.91 GB
SELECT count(*) FROM t WHERE eventnumber > 20000	59.1 GB	59.1 GB	57.69 GB	57.69 GB	57.69 GB	57.69 GB	885.5 GB	27.91 GB
SELECT count(*) FROM t WHERE eventnumber > 500000	59.1 GB	59.1 GB	0.64 GB	0.64 GB	0.64 GB	0.64 GB	885.49 GB	27.9 GB
SELECT eventFile, count(*) FROM t GROUP BY eventFile	59.1 GB	59.1 GB	118.15 GB	118.15 GB	118.15 GB	118.15 GB	885.5 GB	1.64 GB
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile	118 GB	118 GB	0.68 GB	0.68 GB	0.68 GB	0.68 GB	885.47 GB	29.53 GB
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC	177 GB	177 GB	0.90 GB	0.90 GB	0.90 GB	0.90 GB	885.51 GB	36.33 GB
SELECT MAX(runNumber) FROM t	59.1 GB	59.1 GB	118.15 GB	118.15 GB	118.15 GB	118.15 GB	885.49 GB	1.41 GB
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000	118 GB	118 GB	86.53 GB	86.53 GB	86.53 GB	86.53 GB	885.51 GB	34.66 GB
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile	295 GB	295 GB	173.07 GB	173.07 GB	173.07 GB	173.07 GB	885.51 GB	56.38 GB

A note about partitioning

For the tests above, the data was not partitioned. BigQuery, Redshift and Athena all support partitioning but it seems that it would defeat the purpose of trying to query a large file if the queries ended up hitting a much smaller subset of the file. However, it is likely that some datastores have better performance than others and I will do my best to write a new article about partition performance in the future.

Conclusion

Athena is a great solution to analyze large files in a variety of formats (CSV, Json, Nginx logs...) stored on Amazon S3. Using columnar storage like Parquet or ORC it ends up being a powerful and cost effective solution as well.

Redshift benefits from being the big datastore living in the AWS ecosystem. Since Redshift was created on top of PostgreSQL, a lot of the features and syntax is identical which greatly reduces the learning curve. BigQuery recently rolled out Standard SQL which brings the classic SQL syntax to BigQuery as well.

BigQuery also offers a Streaming API which allows you to ingest millions of rows per second for immediate real-time analysis. I found it extremely convenient to use.

Costs are hard to compare since the pricing model is different. If you query your data extensively and can assess efficiently your disk usage and data growth, Redshift will be cheaper. If you want the ability to store a lot of data for a low price and pay for your exact usage, BigQuery is a better solution.

GCE BigQuery vs AWS Redshift vs AWS Athena

Updates

7/20/17 update

7/6/17 update

6/22/17 update

Introduction

Loading time

Queries speed

Queries cost

Data scanned per query

A note about partitioning

Conclusion

Links