Accedi / Registrati Account
Jump to section

What is change data capture (CDC)?

Copia URL

Change data capture is a proven data integration pattern to track when and what changes occur in data then alert other systems and services that must respond to those changes. Change data capture helps maintain consistency and functionality across all systems that rely on data.

Data is fundamental to every business. However, the challenge is that data is constantly being updated and changed. And enterprises must keep up with those changes. Whether it's transactions, orders, inventory, or customers—having current, real-time data is vital to keeping your business running. When a purchase order is updated, a new customer is onboarded, or a payment is received, applications across the enterprise need to be informed in order to complete critical business processes.

When updating a source database—often a relational database such as Oracle, Microsoft SQL Server, Postgres, or mysql—you may need to update multiple related resources such as a cache and a search index. A simple approach would require upgrading your applications to update those resources at the same time. However, trying to consistently write this changed data to more than one target introduces many challenges and coordination overhead. CDC enables you to avoid issues like dual writes to, instead, update resources concurrently and accurately.

CDC accomplishes this by tracking row-level changes in database source tables—categorized as insert, update, and delete events—and then making those change notifications available to any other systems or services that rely on the same data. The change notifications are emitted in the same order they were made in the original database. In this way, CDC ensures that all interested parties of a particular data set are accurately informed of the change and can react accordingly, either refreshing their own version of the data or by triggering business processes.

In modern microservices-driven architectures, CDC has gained new importance by providing an indispensable bridge to connect traditional databases with cloud-native, event-driven architectures. Using CDC, enterprises can continue to use their legacy databases, while still making use of data through emerging technologies. For new deployments, CDC enables the use of useful patterns and schema like the "outbox," which allows microservices to exchange the consolidated data from a database transaction.

While CDC captures database changes, it still requires a messaging service to deliver those change notifications to the applicable systems and applications. The most efficient way to accomplish this is by treating the changes as events—as in an event-driven architecture (EDA)—and sending them asynchronously.

Apache Kafka is the ideal way to provide asynchronous communication between the database and the consumers of the data that require a high-volume, replayable consumption pattern. Kafka is a distributed streaming platform that can publish, subscribe to, store, and process streams of events, in real-time. It's designed to handle data streams from multiple sources and deliver the data to multiple destinations, with high throughput and scalability.

Change data capture ensures the events propagated by Kafka are consistent with the changes in the original source system, or database. Because Kafka messaging is asynchronous, events are decoupled from the consumers, allowing for more reliable delivery of all changes.

Change data capture platforms, like Debezium, track changes in the database by monitoring the transaction log as changes are committed. An alternative to this approach is a simple poll-based or query-based process. 

CDC, based in the transaction log, provides several advantages over these options, including:

  • All changes are captured: CDC is designed to capture every change made to the database. Without CDC, intermediary changes and new data, such as updates and deletes, between two runs of the poll loop might be missed.
  • Low overhead: The combination of CDC and Kafka provides near real-time delivery of data changes. This avoids the increased CPU load caused by frequent polling.
  • No data model impact: Using CDC, timestamp columns are no longer needed to determine the last data update.

The following examples represent a sampling of the many varied use cases for change data capture.

Microservices integration

CDC can be used to sync microservices with monolithic applications, enabling the seamless transfer of data changes from legacy systems to microservices-based applications.

Data replication

CDC can be used for data replication to multiple databases, data lakes, or data warehouses, to ensure each resource has the latest version of the data. In this way, CDC can provide multiple distributed and even siloed teams with access to the same up-to-date data. 

Analytics dashboards

CDC can be used to feed data changes to analytics dashboards—for purposes such as business intelligence—to support time-sensitive decision making.

Auditing and compliance

Facing today's strict data compliance requirements, and heavy penalties for noncompliance, it is essential to save a history of changes made to your data. CDC can be used to save data changes for auditing or archiving requirements. 

Cache invalidation

CDC can be used for cache invalidation, to ensure outdated entries in a cache are replaced or removed in order to display the latest version of a web page.

CQRS model updates

CDC can be used to keep Command Query Responsibility Separation (CQRS) read models in sync with primary models.

Full-text search

CDC can be used to automatically keep a full-text search index in sync with the database.

Change data capture can help your business make faster, data-driven decisions to reduce wasted time, effort, and revenue.

Maximize data value

CDC helps companies maximize the value of data by enabling them to leverage the information for multiple purposes. By providing a method to consistently update the same data in various siloes, CDC allows the organization to get the most out of the data while preserving data integrity.

Keep the business up to date

CDC allows multiple databases and applications to stay in sync with the latest data, giving business stakeholders the most up-to-date information.

Make better, faster decisions

CDC empowers business users to make faster and more accurate decisions based on the most current information that impacts the business. Since data for decision-making often lose value over time, it is important to make that data available to all stakeholders as quickly as possible, using CDC and Kafka. Providing decision-makers with access to accurate near real-time analytics is one of the keys to maintaining a competitive advantage.

Keep operations running without delays

When data, in multiple systems, is not in sync, those systems can have problems reconciling orders, processing transactions, serving customers, generating reports, and even following production schedules, just to name a few potential issues. Any one of these situations can cause delays in doing business, which translates into lost revenue. CDC enables the organization to keep data in sync, with low latency, across numerous systems to keep operations running smoothly.

Red Hat Integration delivers change data capture capabilities via Debezium in combination with Red Hat AMQ Streams and Apache Kafka. Debezium is a distributed open source log-based CDC platform that supports capturing changes from a variety of database systems. Debezium is fast and durable, so your applications can respond quickly and never miss an event.

Keep reading

Topic

Understanding integration

Application and data integration is foundational to delivering new customer experiences and services. Enterprise integration encompasses the technologies, processes, and team structures that connect data, applications, and devices from everywhere in your IT organization.

Article

What is Apache Kafka?

Apache Kafka is a distributed data streaming platform that can publish, subscribe to, store, and process streams of records in real time. It's designed to handle data streams from multiple sources and deliver them to multiple consumers.

Training

Learn the basics of agile integration

Red Hat Agile Integration Technical Overview (DO040) provides a technical introduction to Red Hat’s comprehensive set of integration and messaging technologies.

Bring change data capture to your enterprise

Red Hat AMQ

Red Hat AMQ Streams brings Apache Kafka to Red Hat OpenShift through the use of powerful operators that simplify its deployment, configuration, management, and use.