You’re paying too much for egress
If you’re like most companies, the top line items on your AWS bill are EC2 (compute) and S3 (storage). In data-intensive fields such as biotech, the first thing you do to avoid runaway cloud costs is remind your scientists not to leave their 96-CPU machines idling over the weekend (the second thing you do is write a script that alerts them so a weekend doesn’t turn into a month). Unlike VMs however, you can’t just “turn off” storage, you have to trudge through terabytes of data and decide whether any of it is worth the cost of keeping around.
Although it’s easy to evaluate the cost of storing data per GB on S3, estimating the cost of egress (downloading data off the cloud) is anything but, as illustrated by this glorious mess of a diagram from the Duckbill Group:
The Problem with Egress
What I’ve always found surprising about egress is just how expensive it is. On AWS, downloading a file from S3 to your computer once costs 4 times more than storing it for an entire month; on Google Cloud, it’s 6 times more expensive.
But surprise turns into dismay once you learn that bandwidth doesn’t actually cost cloud providers very much. In fact, AWS charges an 8000% markup on moving data off their cloud, which is certainly an effective way to ensure that users analyze data on the same platform where they store it.
As a result, you have to be careful when you share data with collaborators, copy data to other services within the same cloud, or — heaven forbid — analyze data on a different cloud. The only way to avoid egress is to move the data within the same zone of the same cloud provider. Even then, if you have a VPC with a private subnet, be careful to avoid the extra $45/TB egress fees ¯\_(ツ)_/¯.
That’s no way to live.
This is why Cloudflare R2 is such a breath of fresh air: you get cloud storage for cheaper than S3 (R2 costs $15/TB/mo, S3 costs $23/TB/mo), and much more importantly, with no egress fees: $0 instead of $90/TB!
If that wasn’t enough, data on R2 is distributed all over the world for no extra charge (“Region: Earth”, as the Cloudflare UI likes to point out). This would cost a fortune to set up on other clouds.
Test-driving R2: What’s the catch?
To test out a common genomics use case, I stored data from the 1000 Genomes Project in R2 and ran queries to extract subsets of that data. As a comparison, I ran the same queries on S3 using the same data hosted for free on the AWS Registry of Open Data, which by the way, you can query without providing your S3 credentials, so you’re extra sure to avoid billing surprises (egress is exhausting).
Is egress really free though?
Yes it is: I downloaded 3.2 TBs worth of egress on R2 and paid nothing for it. This little experiment alone would have cost me $300 on AWS 😬.
Is it slow?
No, quite the opposite! On my local computer, I saw download rates of 48MB/s from R2. On S3, I saw download rates of 35MB/s, but the bucket lives in
us-east-1 and I live closer to
us-west so that could be why (yet another reason why having R2’s distributed storage is useful). Keep in mind: this is what I observed running tests on a few afternoons with a dozen files ranging in size from 1GB to 15 GB, so your mileage may vary.
Is it difficult to use?
It is not. As with other providers, R2 provides an S3-compatible API so you can keep using the
aws command-line tool as your normally would, but with a different endpoint URL:
alias aws="aws --endpoint-url https://<id>.r2.cloudflarestorage.com"
R2 is currently in open beta so features such as public buckets are not yet supported. To get around this, I set up a Cloudflare Worker that streams data from R2 — this was very easy thanks to the render package.
Another limitation is that, although R2 is distributed, data will be mostly stored in North America during the beta, so download rates will vary in other regions.
Try it for yourself
If you want to test it out for yourself, log in to your Cloudflare dashboard and enable R2.
Or you can download some genomics data from the Telomere-to-Telomere project that I hosted on my R2 bucket at https://r2-article.robert.workers.dev (I wouldn’t be sharing this link if the data was hosted on S3 🙂).
Overall, I’m very excited about R2. A cloud provider that provides fast, reliable, and distributed storage with no egress fees is a welcome change.
I’m especially excited to see how R2 gets picked up in my own field of bioinformatics, where it will make sharing data much easier, and where compute will no longer be tied down to where data is located!
It’s not just bioinformatics, of course, this is a game changer for any data-intensive field, or really any app where egress of assets makes up a large portion of the cloud bill, such as all audio and video sharing platforms.
Alternative cloud storage providers
Finally, I’ll mention a few other storage providers you may be interested in.
One provider is Wasabi, which features even cheaper storage fees ($5.9/TB) and no egress fees. However, I do want to point out some caveats that are not immediately obvious. First, Wasabi has a minimum storage policy, so if you upload a file and immediately delete it, you still pay for 90 days worth of storage. Also, Wasabi’s free egress policy is not unlimited; it’s free as long as the amount of data downloaded is roughly the same as the amount of data you store, so on average you can only download your data for free once a month.
Another alternative is BackBlaze B2, which offers $5/TB storage and $10/TB egress fees. The egress fees are higher but one little known fact is that Cloudflare and BackBlaze have an agreement such that streaming data from BackBlaze through Cloudflare will cost you $0 in egress fees. I’ve used this myself in the past by streaming B2 data through a Cloudflare Worker — it really works! This is a good alternative if you’re looking for the cheapest way to store data and don’t need the added performance/distributed-ness that R2 offers.