Jump to section

What is change data capture (CDC)?

Copy URL

Change data capture is a proven data integration pattern to track when and what changes occur in data then alert other systems and services that must respond to those changes. Change data capture helps maintain consistency and functionality across all systems that rely on data.

Data is fundamental to every business. However, the challenge is that data is constantly being updated and changed. And enterprises must keep up with those changes. Whether it's transactions, orders, inventory, or customers—having current, real-time data is vital to keeping your business running. When a purchase order is updated, a new customer is onboarded, or a payment is received, applications across the enterprise need to be informed in order to complete critical business processes.

When updating a source database—often a relational database such as Oracle, Microsoft SQL Server, Postgres, or mysql—you may need to update multiple related resources such as a cache and a search index. A simple approach would require upgrading your applications to update those resources at the same time. However, trying to consistently write this changed data to more than one target introduces many challenges and coordination overhead. CDC enables you to avoid issues like dual writes to, instead, update resources concurrently and accurately.

CDC accomplishes this by tracking row-level changes in database source tables—categorized as insert, update, and delete events—and then making those change notifications available to any other systems or services that rely on the same data. The change notifications are emitted in the same order they were made in the original database. In this way, CDC ensures that all interested parties of a particular data set are accurately informed of the change and can react accordingly, either refreshing their own version of the data or by triggering business processes.

In modern microservices-driven architectures, CDC has gained new importance by providing an indispensable bridge to connect traditional databases with cloud-native, event-driven architectures. Using CDC, enterprises can continue to use their legacy databases, while still making use of data through emerging technologies. For new deployments, CDC enables the use of useful patterns and schema like the "outbox," which allows microservices to exchange the consolidated data from a database transaction.

While CDC captures database changes, it still requires a messaging service to deliver those change notifications to the applicable systems and applications. The most efficient way to accomplish this is by treating the changes as events—as in an event-driven architecture (EDA)—and sending them asynchronously.

Apache Kafka is the ideal way to provide asynchronous communication between the database and the consumers of the data that require a high-volume, replayable consumption pattern. Kafka is a distributed streaming platform that can publish, subscribe to, store, and process streams of events, in real-time. It's designed to handle data streams from multiple sources and deliver the data to multiple destinations, with high throughput and scalability.

Change data capture ensures the events transmitted by Kafka are consistent with the changes in the original source system, or database. Because Kafka messaging is asynchronous, events are decoupled from the consumers, allowing for more reliable delivery of all changes.

Change data capture platforms, like Debezium, track changes in the database by monitoring the transaction log as changes are committed. An alternative to this approach is a simple poll-based or query-based process. 

CDC, based in the transaction log, provides several advantages over these options, including:

  • All changes are captured: CDC is designed to capture every change made to the database. Without CDC, intermediary changes and new data, such as updates and deletes, between 2 runs of the poll loop might be missed.
  • Low overhead: The combination of CDC and Kafka provides near real-time delivery of data changes. This avoids increased CPU load caused by frequent polling.
  • No data model impact: Using CDC, timestamp columns are no longer needed to determine the last data update.

The following examples represent some of the varied use cases for change data capture.

Microservices integration

CDC can be used to sync microservices with traditional, monolithic applications, enabling smooth transfer of data changes from legacy systems to microservices-based applications.

Data replication

CDC can be used for data replication to multiple databases, data lakes, or data warehouses, to ensure each resource has the latest version of the data. This way, CDC can provide multiple distributed (and even siloed) teams with access to the same up-to-date data. 

Analytics dashboards

CDC can be used to feed data changes to analytics dashboards—for purposes such as business intelligence—to support time-sensitive decision making.

Auditing and compliance

To comply with strict data compliance requirements and heavy penalties for noncompliance, it is essential to save a history of changes made to your data. CDC can be used to save data changes for auditing or archiving requirements. 

Cache invalidation

CDC can be used for cache invalidation to ensure outdated entries in a cache are replaced or removed in order to display the latest versions.

CQRS model updates

CDC can be used to keep Command Query Responsibility Separation (CQRS) read models in sync with primary models.

Full-text search

CDC can be used to automatically keep a full-text search index in sync with the database.

Change data capture can help your business make faster, data-driven decisions to reduce wasted time, effort, and revenue.

Maximize data value

CDC helps companies maximize the value of data by enabling them to leverage the information for multiple purposes. By providing a method to consistently update the same data in various siloes, CDC allows the organization to get the most out of the data while preserving data integrity.

Keep the business up to date

CDC allows multiple databases and applications to stay in sync with the latest data, giving business stakeholders the most up-to-date information.

Make better, faster decisions

CDC empowers business users to make more accurate, and faster, decisions based on the most current information. Since decision-making data often loses value rapidly, it's important to make it available to all stakeholders as immediately as possible, using CDC and Kafka. Providing access to accurate, near real-time analytics is vital to building and maintaining your competitive advantage.

Keep operations running without delays

When data in multiple systems isn't synchronized, those systems can have problems, like: reconciling orders, processing transactions, serving customers, generating reports, or following production schedules. Any one of these situations can delay your business. That means lost revenue. CDC enables your organization to keep data in sync, with low latency, across numerous systems to keep operations running smoothly.

Red Hat Integration delivers change data capture capabilities, through Debezium, in combination with Red Hat AMQ Streams and Apache Kafka. Debezium is a distributed open source log-based CDC platform that supports capturing changes from a variety of database systems. Debezium is fast and durable, so your applications can respond quickly without missing an event.

Keep reading

Article

What is integration?

Need to know what integration is? Learn what it is, how to incorporate it, and why it’s a lot better with open source.

Article

What is Apache Kafka?

Apache Kafka is a distributed data streaming platform that can publish, subscribe to, store, and process streams of records in real time.

Article

What is an API?

API stands for application programming interface—a set of definitions and protocols to build and integrate application software.

More about integration

Products

A comprehensive set of integration and runtimes technologies engineered to help build, deploy, and operate applications with security in mind and at scale across the hybrid cloud.

Hosted and managed platform, application, and data services that streamline the hybrid cloud experience, reducing the operational cost and complexity of delivering cloud-native applications.

A set of products, tools, and components for developing and maintaining cloud-native applications. Includes Red Hat AMQ, Red Hat Data Grid, Red Hat JBoss® Enterprise Application Platform, Red Hat JBoss Web Server, a Red Hat build of OpenJDK, a Red Hat build of Quarkus, a set of cloud-native runtimes, Migration Toolkit for Applications, single sign-on, and a launcher service.

A comprehensive set of integration and messaging technologies to connect applications and data across hybrid infrastructures. Includes Red Hat 3scale API Management, Red Hat AMQ, Red Hat Runtimes, change data capture, and a service registry.

Resources

E-book

Create an agile infrastructure—and enable an adaptive organization

Training

Free training course

Red Hat Agile Integration Technical Overview