What Is Change Data Capture (CDC)
Learn what Change Data Capture (CDC) is, how it works, when to use it, pros and cons, patterns, tools, and best practices. Plus how CDC connects to a modern data stack.
Quick overview
Change Data Capture (CDC) is a design pattern for continuously identifying inserts, updates, and deletes in a source system and propagating only those changes downstream. Compared to batch ETL, CDC reduces latency, minimizes source load, and unlocks real-time analytics and event-driven architectures. Modern implementations favor log-based approaches that read database transaction logs for low overhead and full fidelity.
Why CDC is important today
Your company’s operational truth changes constantly. A user updates their profile, an order is amended, a subscription is canceled. If your analytics, automations, and customer touch points lag hours behind, you’re making decisions on stale data. CDC narrows that gap by streaming only what changed, not entire tables. In practice, that means fresher dashboards, leaner pipelines, and less strain on engineering teams as datasets scale.
CDC is more than a speed boost. It reduces load on primary systems, creates reliable change history, and enables downstream systems to react more quickly. Instead of thinking in large nightly data jobs, CDC encourages smaller, incremental data flows that keep everything continuously up to date.
What CDC actually is (and isn’t)
From our perspective at Weld, one important caveat is that not every team should rush to adopt CDC. If your datasets are small, change slowly, or your systems can tolerate a few hours of delay, then traditional batch processes may be simpler and more cost-effective.
CDC is not a single product. It’s an architectural approach that can be implemented in different ways, but the outcome is always the same: capture changes at the row level and move them downstream in a reliable and consistent fashion. Many databases and platforms expose CDC natively. Others rely on connectors that read the database’s transaction log or use triggers.
In short: CDC continuously captures inserts, updates, and deletes from a source system and delivers them to downstream systems so they can stay in sync in near real time.

How CDC works: core patterns and trade-offs
Here are the main approaches we see in practice, along with examples:
- Timestamp or version polling - Query rows where
last_modified > last_sync. It’s simple to set up, but in one SaaS dataset we looked at, this method missed deletes entirely, which led to inconsistencies. - Snapshot diffs - Take full or partitioned snapshots and compare them. This can work universally, but we’ve seen teams struggle with rising costs when table sizes grow into the billions of rows.
- Database triggers - Write changes to an outbox or change table. This ensures inserts, updates, and deletes are captured, but a fintech team we spoke with experienced latency spikes because every write fired additional logic.
- Log-based CDC - Read directly from transaction logs like Postgres WAL or MySQL binlog. This is the gold standard today: it captures every change with strong ordering guarantees and little extra strain. In our internal testing with Postgres, log-based CDC has consistently kept replication lag low and stable.
There are a few common approaches to implementing CDC. Timestamp or version polling is the simplest but often misses deletes and can add overhead. Snapshot diffs work universally but become costly and slow at scale. Database triggers can capture all change types but risk adding latency to writes. Log-based CDC, which reads directly from transaction logs like Postgres WAL or MySQL binlog, is generally the most efficient, with minimal overhead and complete coverage of inserts, updates, and deletes.
Among these, log-based CDC has become the standard for production systems because it captures every change with strong ordering guarantees while putting little additional strain on the source database.
The CDC pipeline, end to end
A typical CDC pipeline starts with an initial snapshot to provide downstream systems with a baseline. Once that snapshot is complete, the system begins capturing new changes as they happen. These changes are transported through a message bus or directly into the warehouse. Along the way, they may be filtered, enriched, or deduplicated before being applied as upserts or deletes. To ensure reliability, offsets and checkpoints are stored for recovery. Finally, observability is essential: monitoring lag, throughput, and error rates ensures pipelines run smoothly, even as schemas evolve.
Where CDC shines and where it’s overkill
CDC is an excellent fit for scenarios where fresh data truly matters. Real-time analytics, customer 360 projects, and compliance use cases all benefit from data that reflects operational reality within seconds or minutes. It’s also a powerful enabler for event-driven architectures and microservices.
But CDC isn’t necessary everywhere. For small, slowly changing datasets, nightly batch updates may be more than enough. And in legacy systems where you can’t access logs or add triggers, implementing CDC may be too costly compared to the benefits.
CDC vs ETL vs app events
It’s important to understand what CDC is, and isn’t, in relation to other approaches. Traditional ETL pipelines move bulk data on a schedule, often requiring heavy merges to stay current. CDC continuously streams only the changes, reducing the load and improving freshness. Application events are useful for specific workflows, but they don’t always provide a complete picture of state. CDC is rooted in the database itself, making it a reliable source of truth. In many modern data stacks, CDC and ETL coexist, each serving different needs.
From our experience at Weld, there are even cases where teams think they need CDC but actually don’t. If the only requirement is to refresh a dashboard every few hours, implementing CDC might add unnecessary complexity without meaningful benefit. Therefore it important to understand what CDC is, and isn’t, in relation to other approaches. While traditional ETL pipelines move bulk data on a schedule, often requiring heavy merges to stay current, CDC continuously streams only the changes, reducing the load and improving freshness. Application events are useful for specific workflows, but they don’t always provide a complete picture of state.
Tooling landscape
There’s a wide ecosystem of CDC tools. Open-source projects like Debezium and Kafka Connect provide log-based CDC connectors for popular databases. Many cloud databases, including PostgreSQL and MySQL, now expose native CDC features. Cloud warehouses such as BigQuery and Snowflake support efficient upserts and deletes to make applying CDC changes easier. Commercial platforms also exist, offering turnkey connectors, monitoring, and enterprise support.
When choosing a tool, it’s important to evaluate which databases and platforms it supports, how it handles schema changes and backfills, the delivery guarantees it offers, and what kind of monitoring is built in.
How CDC works in Weld
Weld’s CDC connectors are designed to bring real-time database changes into your data warehouse with minimal operational overhead. Rather than scanning tables or relying on timestamp-based syncs, Weld reads directly from database logs, the source of truth for all committed transactions.
The diagram below illustrates how Weld CDC captures row-level changes from your operational databases, adds metadata for tracking, and keeps your data warehouse perfectly aligned in real time.

Figure: Weld’s CDC flow captures inserts, updates, and deletes directly from the source database, enriches them with metadata fields like _weld_updated_at and _weld_deleted_at, and keeps your warehouse tables continuously in sync, without manual snapshots or merges.
MySQL CDC
Weld’s MySQL CDC connector streams row-level inserts, updates, and deletes by reading from MySQL’s binary log (binlog) through a replication connection. This approach delivers lower latency, reduces read pressure on the primary, and reliably captures deletes. Weld consumes row-based events from the binlog and applies them downstream in near real time.
To enable this, MySQL must have binary logging turned on and configured for row-based capture. Typical prerequisites include: log_bin=ON, binlog_format=ROW, binlog_row_image=FULL, a unique server_id, and a retention window (binlog_expire_logs_seconds=604800 recommended). Each CDC table must have a primary key or unique index to ensure updates and deletes are applied correctly.
For more details on MYSQL CDC visit our documentation on our website here.
Best practices
Start small, with a single high-value domain, and expand gradually. Favor log-based CDC where possible, since it offers the most reliable coverage with the least overhead. Make sure downstream logic is idempotent so it can handle retries. Build transformations that are aware of schema versions and document ownership of each pipeline. Finally, separate backfills from live streams to protect latency and reliability.
Where Weld fits today
Weld is built for modern data teams that want high-quality, timely data without brittle jobs. Today, customers use Weld to ingest from databases and SaaS tools, model that data with SQL, orchestrate reliable pipelines, and activate fresh data back into business tools. This foundation makes CDC-style workflows easier to adopt because the downstream modeling, orchestration, and activation are already solved.
[Coming soon] Change Data Capture (CDC) for PostgreSQL & MySQL We’ve started internal testing for CDC support to enable real-time replication of row-level inserts, updates, and deletes. Weld will automatically detect eligible tables and capture change events seamlessly.
FAQs
Does CDC replace ETL?
No. CDC complements batch ETL. Use CDC for freshness and ETL for historical rebuilds.
How do warehouses handle CDC?
BigQuery and Snowflake can apply streamed upserts and deletes efficiently.
What about SQL Server?
SQL Server exposes native CDC features to record changes.






