BigQuery

BigQuery is a managed, serverless data warehouse product by Google, offering scalable analysis over large quantities of data.

Features

Feature nameSupported
Column HashingTrueColumn level
BlockingTrueColumn level
IncrementalTrue
Custom dataTrue
HistoryFalse
ReSyncTrueTable level
TemplatesFalse

πŸ”§ Setup Guide

1. Enable Cloud Resource Manager API in GCP Go to Cloud Resource Manager API and press Enable

This is required for us to verify that the service account provided to weld, has the correct permissions to operate. If there is no Enable button, it might be enabled already.

2. Create service account Go to Service Accounts, and press create new account.

Give it a name and press Create and continue. Grant the service account the following roles:

  • BigQuery Job User
  • BigQuery Data Viewer

now press Done

3. Download your service account key

In the Service accounts window, click the service account you just created, and then press Keys.

Now click Add Key, and then Create New Key. Select JSON, and then press Create.

4. Almost done!

With your freshly created service account key file downloaded, all that is left to do is to drop it into the box in the weld app here, select your GCP hosting location, and you are ready to start syncing data from your BigQuery!

Step 3 - Data To Sync

  1. Select the tables you wish to include in the sync.
    You can view the schema, remove columns or hash sensitive information.

Step 4 - Configure Sync

  1. Select how often you would like the data to sync.
  2. Provide a unique destination table name.

Weld will take over from here and commence syncing data from your BigQuery.

Optimizing Your Data with Clustering and Partitioning

In BigQuery, you can use clustering and partitioning to make your queries run faster and cost less. Think of it like organizing a huge library. Instead of searching every single book for information, you can go directly to the right section and find what you need quickly.

Partitioning: Dividing Your Table into Sections

Partitioning is like dividing your library into different sections, such as "Fiction," "History," and "Science." In BigQuery, you divide your data table into smaller, more manageable parts based on a specific column, usually a date.

When you run a query that asks for data from a specific time period, BigQuery only scans the relevant partitions (sections), ignoring the rest. This can dramatically speed up your queries and reduce costs, as you're not scanning the entire table.

In Weld, you can easily set up partitioning. You just need to tell it which column to use for organizing the data.

Examples:

  • To partition by a date column named created_at, you would enter: DATE(created_at)
  • If your table doesn't have a date column, BigQuery can automatically add a special timestamp when your data is loaded. To partition by this, you can use an expression like: TIMESTAMP_TRUNC(_PARTITIONTIME, DAY)

You can see more about how partitioning works here: Partitioning Documentation

Querying Partitioned Tables

When you query a partitioned table, you use a WHERE clause to tell BigQuery which "section" of the library to look in. This is how you take advantage of partitioning to save time and money.

Example: If your table is partitioned by a created_at date column, your query might look like this:

SELECT *
FROM my_dataset.my_table
WHERE DATE(created_at) = "2023-10-26"

This query tells BigQuery to only scan the partition for October 26, 2023, making it much more efficient.

You can find more details in the official documentation: Querying Partitioned Tables

Clustering: Sorting the Data Within Each Section

Clustering is like organizing the books within each section of the library alphabetically by title or author. It sorts your data based on the columns you choose, so BigQuery can find the exact rows it needs even faster.

When you run a query with a filter on a clustered column, BigQuery knows exactly where to look within the partition, avoiding a full scan of that section.

In Weld, you can specify which columns to cluster by simply by listing them.

Example: To cluster by customer_id and order_date, you would enter: customer_id,order_date

By using partitioning and clustering together, you can make your BigQuery queries extremely fast and cost-effective.

You can see more about how clustering works here: Clustering Documentation

Was this page helpful?