Amazon Redshift - Provisioned clusters

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is designed to handle large datasets and perform complex queries on structured data. Redshift is based on PostgreSQL, and it is optimized for high-performance analysis and reporting.

Weld supports Redshift Serverless and Redshift Provisioned clusters. For Redshift Serverless use the Redshift integration and follow this guide.

Prerequisites

An active AWS account with access to Redshift, S3 and IAM roles/policies
An active Redshift Provisioned cluster deployment
An Amazon S3 bucket used for data loading

Setup guide

Part 1 - Set up S3 bucket

Create an S3 bucket where Weld will upload the data in GZIP NDJSON format before loading it into Redshift

Make sure that the bucket is not publicly accessible.

Weld will not expire the items in the bucket by default. If you would like to set up lifecycle rules to automatically delete old data, you can do so via the S3 console. This will only remove raw data from the bucket, it will not delete the data from Redshift.

You can also use this S3 bucket as a data lake as it will always have the raw data available. An example Redshift query that can be used to copy data from the S3 bucket to Redshift is:

COPY "{{SCHEMA_NAME}}"."{{TABLE_NAME}}" FROM '{{S3_PATH}}' iam_role '{{IAM_ROLE_ARN}}' GZIP JSON as 'auto ignorecase' dateformat 'auto'

Where {{IAM_ROLE_ARN}} is the ARN of the IAM role that you will set up in the Part 3.

Part 2 - Redshift Setup

For the following steps, you will need to be able to run queries against Redshift with admin permissions. You can do this either by opening the built in Query Editor in the Redshift console, or via a 3rd party database management tool.

Step 1 - Create database

CREATE DATABASE "weld";

Step 2 - Create user for database:

CREATE USER "weld_user" PASSWORD DISABLE;

Step 3 - Grant permissions to user:

GRANT ALL PRIVILEGES ON DATABASE "weld" TO "weld_user";

Note: You can use any database name and username, but you will need to ensure that the IAM role is set up correctly in the next step.

Part 3 - IAM Setup

Step 1 - Create an IAM policy with the following permissions (edit in JSON mode)

These S3 permissions ensure that Weld is able to upload data to S3 and Redshift is able to read from S3. The Redshift Serverless permissions ensure that Weld can get temporary credentials for Redshift Serverless. Weld only supports temporary credentials for Redshift as this is the most secure way to access Redshift.

{
	"Version": "2012-10-17",
	"Statement": [
    {
  	  "Sid": "s3Access",
  	  "Effect": "Allow",
  	  "Action": [
  		  "s3:PutObject",
  		  "s3:GetObject",
  		  "s3:GetEncryptionConfiguration",
  		  "s3:ListBucketMultipartUploads",
  		  "s3:AbortMultipartUpload",
  		  "s3:ListBucket",
  		  "s3:DeleteObject",
  		  "s3:GetBucketLocation",
  		  "s3:ListMultipartUploadParts"
  	  ],
  	  "Resource": [
  		  "arn:aws:s3:::{{YOUR_S3_BUCKET_NAME}}/*",
  		  "arn:aws:s3:::{{YOUR_S3_BUCKET_NAME}}"
  	  ]
    },
    {
  	  "Sid": "redshiftAccess",
  	  "Effect": "Allow",
  	  "Action": [
  		  "redshift:GetClusterCredentials"
  	  ],
  	  "Resource": [
  		  "{{REDSHIFT_USER_ARN}}",
  		  "{{REDSHIFT_DB_ARN}}"
  	  ]
    }
  ]
}

Note: {{REDSHIFT_USER_ARN}} and {{REDSHIFT_DB_ARN}} are the ARNs of the Redshift user and database respectively. These cannot be found in the Redshift console, but can be easily constructed:

arn:aws:redshift:{{AWS_REGION}}:{{AWS_ACCOUNT_ID}}:dbuser:{{REDSHIFT_CLUSTER_NAME}}/{{REDSHIFT_USERNAME}}

arn:aws:redshift:{{AWS_REGION}}:{{AWS_ACCOUNT_ID}}:dbname:{{REDSHIFT_CLUSTER_NAME}}/{{REDSHIFT_DATABASE_NAME}}

Step 2 - Create an IAM role and add the following trust relationships to ensure that the Weld AWS account (887019001183) and Redshift can assume the role:

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Effect": "Allow",
          "Principal": {
              "AWS": "arn:aws:iam::887019001183:root"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
              "StringEquals": {
                  "sts:ExternalId": "{{EXTERNAL_ID}}"
              }
          }
      },
      {
          "Effect": "Allow",
          "Principal": {
              "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
      }
  ]
}

For {{EXTERNAL_ID}} you can use any string. You will need to use the same string in the Weld UI when configuring the Redshift connection.

Step 3 - Attach IAM policy to IAM role

On the IAM role page, click on Add permissions then select Attach policies. Search for the policy that you created in Part 3 - Step 1.

Part 4 - IAM role association with Redshift

Open the Redshift cluster, and click on the Actions button.
Click on Manage IAM Roles
Select the previously created role from the dropdown and press Associate IAM Role. After this, save the changes.

Checklist

Ensure that you have completed the following steps before proceeding to the next part:

Create S3 bucket
Create Redshift database
Create Redshift user
Grant all privileges for database to user
Create IAM policy with S3 and Redshift permissions
Create IAM role with trust relationship
Attach IAM policy to IAM role
Associate IAM role with Redshift cluster

Part 4 - Weld Setup

Fill out the details in the Weld UI for the Redshift data warehouse connection
- For the host and port, use the Endpoint from your Redshift cluster details. Note that the Endpoint will be in the format {{HOST}}:{{PORT}}/{{DATABASE}}
- For the database, use the Database that you created in Part 2 - Step 1. If you didn't change the name, it will be called weld
- For the user, use the username that you created in Part 2 - Step 2. If you didn't change the name, it will be called weld_user
- For External ID you can use any string, but ensure that it is the same as the one in the IAM Role's trust relationship.
- For the S3 Bucket, input your the name of the S3 bucket that you created in Part 1.
If using SSH, enable it and configure the SSH credentials
Click Connect