Companies on the mission to become data-driven are faced with the challenge of having their data scattered across various SaaS tools. Today, an SMB will use on average 16–24 different tools—each producing and storing data in its own database. As you can imagine, extracting a single source of truth from this data can be quite challenging. However, once you implement the modern data stack and centralize your data storage in one place, the complexity of analytics is significantly reduced, allowing you to derive valuable insights from a centralized data set that can drive your business forward.

A popular option for centralizing your data is a data warehouse (DWH). Although data warehouses have been in development since the 1980s, we saw a surge in the 2010s thanks to advances in cloud storage. Today, a wide variety of excellent cloud-based warehouse solutions exist—with BigQuery, Amazon Redshift, Snowflake, and Firebolt being just a few examples. These cloud-based data warehouses are easy to set up and use, offering cheap storage and almost unlimited computing power.

But what’s the catch? While these powerful tools allow you to do nearly anything with your data, they offer a variety of settings and choices that, over time, can lead to an almost exponential increase in cost. In this article, we will explain the different cost structures pertaining to the most used data warehouses, so you can figure out which DWH works best (and cheapest) for your setup.

There is no doubt that it was easier to monitor and limit data-related expenses before the advent of cloud DWH. What “the cloud” DWH really did for data analysis was open up the possibility for engineering teams to easily—and sometimes automatically—add new resources to match their computational needs. However, the ability to constantly increase your resources can quickly get out of hand, making it hard to know which resources are necessary and what purpose they serve. If you want to control the cost of your data warehouse, it is imperative to understand what the DWH bill consists of.

Understand Your Data Warehouse Bill: Compute and Storage

The main purpose of your DWH is to centralize your data storage and efficiently extract insights. Therefore, most DWH costs are related to compute (the ability to quickly extract insights) and storage (the amount of data you have in your DWH). While additional services may be available, they are provider dependent and will not be covered here. In the following section, we dive deeper into the different pricing models various vendors offer, and how they quantify and measure the processing you do.

DWH pricing models related to compute are categorized as on-demand, flat-rate, or hybrid. You will encounter these different options regardless of which DWH you choose, as they are used universally across all providers.

Flat-rate pricing allows you to buy a pre-specified amount of resources that you have at your disposal. This model is the most predictable when it comes to budgeting and scaling, provided you have a stable and consistent need for resources. Flat-rate pricing is usually cheaper per resource since the providers know in advance how many resources to allocate.

On-demand means that your storage and compute resources scale automatically as needed. This is a great option if you plan to scale resources up and down quickly; however, it is also where you can rack up the most cost. One important thing to note is that DWH providers may charge you per data scanned (measured in bytes) and per processing time (measured in time). Billing intervals are often very short, but predefined minimum thresholds are common. For example, in BigQuery, you will be charged for a minimum of 10MB even if you run a 10KB query. In Snowflake, you will be charged per second used over the first minute of compute. The on-demand model is more expensive because providers cannot plan resource needs in advance, so you pay extra for flexibility.

Hybrid models allow you to buy a base level of resources, which can then be supplemented with on-demand resources. This model is ideal for teams with consistent daily consumption that occasionally run larger loads.

Where compute has neatly defined categories, storage costs differ significantly between DWH providers. Let’s look at them individually.

Snowflake differentiates between storage in two categories: ‘pre-purchased capacity’ and ‘on-demand storage’. You will receive a larger bill for on-demand storage—the rate increases from $24.5/TB for pre-purchased to $45/TB for on-demand. For this reason, it is worthwhile to assess your needs beforehand and secure an adequate pre-purchased amount of storage to save cost.

BigQuery on the other hand charges differently for ‘long term storage’ and ‘active storage’. Long term storage refers to tables that have not been modified for 90 days; here, the cost drops by 50% without affecting performance, durability, or availability. Keep in mind that even the smallest change to a table in long-term storage can result in a hefty price tag.

AWS Redshift and Firebolt stick to a fixed storage price, meaning there is no cost difference whether the storage is active, long-term, pre-purchased, or on-demand.

Be Aware of Batch vs. Streaming

When loading data into your DWH, most providers offer a choice between batch loading and streaming.

  • Batch loading loads data in regular intervals—bringing data into the DWH in chunks. This option is typically offered for free.
  • Streaming inserts new data (nearly) in real time as it is created and is often a premium feature associated with a fairly high cost.

Most of the time, batch loading is a perfectly adequate option since you can set your sync frequency based on your needs. Unless you have a use case where real-time data is required (such as real-time transactional data), stick with batch.

What Is the Best Option?

Unfortunately, there is no clear-cut answer since the choice of DWH depends entirely on the query-usage patterns of your company. For example, if you have a low volume of data but run complex queries, you may be better off paying per processed data. However, if you run simple queries on large data sets, a processed-time–based DWH might be cheaper.

No matter which option you choose, it is a good idea to organize the costs shown on your provider’s billing page into categories for better visibility on your consumption. For instance, you can label costs across departments and use billing reports to get insights into which areas of your setup drive the most expense. Many providers offer custom cost dashboards where you can monitor all areas more closely. This enables you to evaluate the cost of different parts of your DWH and consider the value they produce. As you do this, ask yourself: “Are we spending most of our DWH cost on data sets with the highest business impact?”

cta button