Change Data Capture (CDC)

Change Data Capture (CDC) is a design pattern for continuously identifying inserts, updates, and deletes in a source system and propagating only those changes downstream.

Compared to batch replication, CDC reduces latency, minimizes load on source systems, and helps keep downstream systems consistent as data changes.

When to use CDC

CDC is a good fit when:

You want near real-time replication (seconds/minutes instead of hours)
You need reliable deletes and updates (not just new rows)
You want to reduce load compared to frequent full-table scans on large tables
You want downstream systems to react faster to operational changes

CDC can be overkill if your datasets are small, change slowly, or a few hours of delay is acceptable.

Common CDC approaches

Different systems implement CDC in different ways. The most common approaches are:

Timestamp or version polling

Query rows where last_modified > last_sync.

✅ Simple to set up
⚠️ Easy to miss deletes
⚠️ Requires careful handling of late updates / clock skew

Snapshot diffs

Take periodic snapshots and compare them.

✅ Works in most systems
❌ Can get expensive as tables grow
⚠️ Latency depends on snapshot frequency

Triggers / outbox tables

Write changes to a change table during writes.

✅ Captures inserts/updates/deletes reliably
⚠️ Adds write overhead and operational complexity

Log-based CDC

Read from the database’s transaction log or native change stream (e.g. WAL/binlog/change streams).

✅ Best coverage for inserts/updates/deletes
✅ Strong ordering guarantees
✅ Low read load on the source (no table scans)

In most production systems, log-based CDC is preferred when available.

CDC vs other sync modes

Approach	How it works	Captures deletes	Low latency	Low source load	Best fit
Full table sync	Re-copy entire tables on a schedule	❌	❌	❌	Small datasets
Incremental sync	Pull rows using timestamps or IDs	⚠️	⚠️	✅	Moderate freshness needs
Change Data Capture (CDC)	Stream changes from logs/streams	✅	✅	✅	Real-time systems

How CDC works in Weld (high level)

Connect your database in Weld using a dedicated user.
Enable CDC mode for the connector.
Select tables/collections to capture.
Weld consumes changes and applies them to your destination based on your latency/apply settings.

Supported databases

Best practices

Start with a small, high-value domain and expand gradually
Ensure each replicated table/collection has a stable identifier (primary key or equivalent)
Make downstream apply logic idempotent (safe on retries)
Plan for schema changes and backfills
Monitor freshness/lag, throughput, and errors to avoid silent drift

Housekeeping

If you stop or permanently delete a CDC sync, clean up any CDC-related resources created for that sync (where applicable) to avoid unnecessary retention or overhead.

FAQs

Choosing CDC

Does CDC replace ETL?
Not usually. CDC and batch ETL are complementary:

Use CDC for continuous freshness (inserts/updates/deletes) and low-latency sync.
Use batch ETL for initial loads, backfills, historical rebuilds, and recomputing derived models.
A common setup is: snapshot/backfill first, then CDC keeps tables current.

Is CDC always the right choice?
No. CDC adds operational complexity. If your data changes slowly, tables are small, or hourly/daily updates are fine, scheduled syncs can be simpler and cheaper.

What latency should I expect?
CDC is typically seconds to minutes, but depends on source write volume, network, and how changes are applied in the destination (per-event vs micro-batch).

Data correctness

Do I need primary keys (or unique IDs) for CDC?
For reliable updates/deletes, yes.

Relational tables should have a primary key (or stable unique key).
MongoDB typically uses _id.
Without a stable identifier, downstream systems can’t safely apply updates/deletes to the correct record.

Will CDC capture deletes?
With true CDC (log/stream-based), yes. How deletes appear downstream depends on your destination strategy (hard deletes vs soft deletes/flags).

How do schema changes work with CDC?
Adding columns is usually straightforward. Renames/drops/type changes often require coordination and sometimes a backfill. Treat schema evolution as a planned workflow.

Warehouses and apply patterns

How do warehouses apply CDC?
Most destinations apply CDC using one of these patterns:

Stage → MERGE: land change events in a staging/delta table, then MERGE into the target table.
Micro-batch apply: accumulate changes briefly and apply in batches (often better cost/perf).
Soft delete: represent deletes via a deleted flag/timestamp when physical deletes aren’t desired.

Operations and monitoring

Does CDC impact my production database?
CDC avoids table scans, but it’s not free:

It reads from a database’s log/stream and may require retaining logs long enough for the consumer to keep up.
If a CDC sync is paused or abandoned without cleanup, log retention can grow and consume disk.

What should I monitor?
At minimum:

Replication/apply lag
Error rate / retries
Log retention pressure (WAL/binlog/oplog growth)
Destination apply/merge duration

What happens if Weld goes down or the connection drops?
CDC pipelines typically resume from the last stored position (offset/checkpoint). If the source retention window is exceeded while offline, a resnapshot/backfill may be required.

Can I backfill without breaking CDC?
Yes. Treat backfills as a controlled operation:

Prefer staging/backfill tables and swap/merge strategies
Expect higher destination apply cost during large backfills
Keep CDC running where possible

Database-specific

How does PostgreSQL CDC work?
PostgreSQL CDC is typically WAL-based logical replication, using a replication slot and a publication to define which tables to capture.
See: PostgreSQL CDC

How does MySQL CDC work?
MySQL CDC is typically binlog-based, streaming row changes via a replication connection (commonly binlog_format=ROW).
See: MySQL CDC

How does Microsoft SQL Server CDC work?
SQL Server provides native CDC that reads the transaction log and records changes for enabled tables (commonly via CDC change tables).
See: Microsoft SQL Server CDC

How does MongoDB CDC work?
MongoDB CDC is typically implemented using Change Streams, which emit document-level change events.
See: MongoDB CDC