Companies on the mission to become data-driven, are faced with the challenge of having their data scattered across various SaaS tools. Today an SMB will have on average 16-24 different tools, all of which produce and store data in their own databases. So you can imagine that trying to extract a single source of truth from this data can be quite challenging. However, once you implement the ‘modern data stack’, and you centralize your data storage into one place, the complexity of analytics significantly reduces, and it is possible to get valuable insights from a centralized data set that can drive your business forward.
A popular option for centralizing your data is a Data Warehouse (DWH). Though data warehouses have been in development since the 1980’s, we saw a surge in the 2010s thanks to advances in cloud storage, and a wide variety of excellent cloud-based warehouses solutions came to exist with BigQuery, Amazons’ Redshift, Snowflake, and Firebolt being just a few examples. These cloud-based data warehouses are easy to set up, use, and offer cheap storage and almost unlimited computing power.
But what’s the catch? While these powerful tools will allow you to do nearly anything with your data, they offer a variety of different settings and choices, which over time can lead to an almost exponential increase in cost. In this article we will explain the different cost structures pertaining to the most used Data Warehouses, so you can figure out which DWH works best (and cheapest) for your setup.
There is no doubt that it was easier to monitor and limit the cost of a data-related expense before the ‘Cloud’ DWH. What ‘The Cloud’ DWH really did for data analysis is that it opened up for engineering teams to easily ,and at times automatically, add new resources in order to match their computational needs. However, the ability to constantly increase your resources can quickly get out of hand and it can be hard to know which resources are necessary and what purpose they serve. If you want to have control of the cost of your data warehouse, it is imperative to understand what the DWH bill consists of.
Understand your data warehouse bill: compute and storage
The main purpose of your DWH is to centralize your data storage and to efficiently extract insights. Therefore, most DWH costs are related to compute (the ability to quickly extract insights) and storage (the amount of data you have in your DWH). Most often there are additional services that you can buy on top, but they are provider dependent and we will not cover these in this article. In this following section we will dive a bit deeper into the different pricing models the various vendors offer, and how they quantify and measure the processing you do.
Compute Related Pricing
DWH pricing models related to compute are categorized as ‘on-demand’, ‘flat-rate’, or ‘hybrid’. You will encounter these different options regardless of which DWH you end up choosing as they are used universally across all providers.
Flat-rate will allow you to buy a pre-specified amount of resources, which you will have at your disposal. This is the most predictable model when it comes to budget, when it comes to scaling, so the prerequisite to using this model is that you have a stable and consistent need for resources. Flat rate pricing is usually cheaper per resource, as the DWH providers know how many resources to allocate in advance.
On-demand means that your storage and compute resources scale automatically as you need them. This is a great option if you are planning on scaling your resources up-and-down quickly, but it is important to be aware that it is also where you can rack up the most cost. One important thing to be aware of , and which is often overlooked, is that DWH providers will either charge you per data scanned (measured in bytes) per processing time (measured in time). The billing-intervals are often very small, but predefined minimum thresholds are common. For example, in Big Query you will be charged for a minimum of 10MB even if you run a 10KB query. In Snowflake you will be charged per second used over the first minute of compute. The on-demand model is more expensive because the DWH providers cannot plan the resources needed in advance, so you pay extra for flexibility.
Hybrid models allow you to buy a base of resources, which can then be supplemented with on-demand resources. This model is ideal for teams that have a very consistent daily consumption, and only occasionally run larger loads.
Storage Related pricing
Where compute has neatly set categories, storage significantly differs between all DWH providers. So let’s look at them individually.
Snowflake, differentiates between storage in two categories: ‘pre-purchased capacity’ and ‘on-demand storage’. You will receive a larger bill for on-demand storage, where the rate increases from 24.5$/TB for pre-purchased to 45$/TB for on-demand. For this reason it is worthwhile considering your needs beforehand and getting an adequate pre-purchased amount of TB in order to save cost.
BigQuery on the other hand will charge you differently for ‘long term storage’ and ‘active storage’. Long term storage refers to tables that have not been modified for 90 days, here the cost will drop by 50% and has no effect on change in performance, durability or availability. So keep in mind that even the smallest change to a table in the ‘long-term storage’ can come with a hefty price tag.
AWS Redshift and Firebolt stick to a fixed storage price. This means that there is no difference in cost depending on whether the storage is active, long-term, pre-purchased or on demand.
Be aware of Batch vs. streaming
When loading data into your DWH, most providers will offer a choice between 'batch loading' and 'streaming'.
Batch loading loads data in regular intervals and the data comes into the DWH in chunks’, this option is typically offered for free.
Streaming inserts new data (nearly) real-time as the data is being created, and is often a premium feature associated with a fairly high cost.
Most of the time batch is a perfectly adequate option as you can set your sync frequency based on your needs. Unless you have a use case, where real-time data is required, such as real-time transactional data, stick with batch.
What is the best option?
Unfortunately, there is no clear cut answer as the choice of DWH completely depends on the query-usage patterns that your company has. For example, if you have a low volume of data, but do complex queries, you are better off paying per processed data. However, if you run simple queries on large data sets it will be cheaper with a processed-time based DWH.
No matter which choice you make, It is a good idea to organize your costs found in your providers billing-page into categories, so you have visibility on your consumption. You can for example label costs across departments, and you can get an overview of the costs in billing reports, which will give you useful insights into what areas of your setup are driving the most cost. Most of the providers will offer custom cost-dashboards, where you can monitor all the areas more closely. Here you can evaluate the cost of the different parts of your DWH and consider the value that they produce. While doing this, ask yourself the question “are we spending the most of our DWH cost on data sets with the most business impact?"