What is CDC?
Change data capture is a proven data integration pattern to track when and what changes occur in data then alert other systems and services that must respond to those changes. Change data capture helps maintain consistency and functionality across all systems that rely on data.
Data is fundamental to every business. However, the challenge is that data is constantly being updated and changed. And enterprises must keep up with those changes. Whether it's transactions, orders, inventory, or customers—having current, real-time data is vital to keeping your business running. When a purchase order is updated, a new customer is onboarded, or a payment is received, applications across the enterprise need to be informed in order to complete critical business processes.
How change data capture works
When updating a source database—often a relational database such as Oracle, Microsoft SQL Server, Postgres, or mysql—you may need to update multiple related resources such as a cache and a search index. A simple approach would require upgrading your applications to update those resources at the same time. However, trying to consistently write this changed data to more than one target introduces many challenges and coordination overhead. CDC enables you to avoid issues like dual writes to, instead, update resources concurrently and accurately.
CDC accomplishes this by tracking row-level changes in database source tables—categorized as insert, update, and delete events—and then making those change notifications available to any other systems or services that rely on the same data. The change notifications are emitted in the same order they were made in the original database. In this way, CDC ensures that all interested parties of a particular data set are accurately informed of the change and can react accordingly, either refreshing their own version of the data or by triggering business processes.
In modern microservices-driven architectures, CDC has gained new importance by providing an indispensable bridge to connect traditional databases with cloud-native, event-driven architectures. Using CDC, enterprises can continue to use their legacy databases, while still making use of data through emerging technologies. For new deployments, CDC enables the use of useful patterns and schema like the "outbox," which allows microservices to exchange the consolidated data from a database transaction.
Real-time changes with CDC and Apache Kafka
While CDC captures database changes, it still requires a messaging service to deliver those change notifications to the applicable systems and applications. The most efficient way to accomplish this is by treating the changes as events—as in an event-driven architecture (EDA)—and sending them asynchronously.
Apache Kafka is the ideal way to provide asynchronous communication between the database and the consumers of the data that require a high-volume, replayable consumption pattern. Kafka is a distributed streaming platform that can publish, subscribe to, store, and process streams of events, in real-time. It's designed to handle data streams from multiple sources and deliver the data to multiple destinations, with high throughput and scalability.
Change data capture ensures the events transmitted by Kafka are consistent with the changes in the original source system, or database. Because Kafka messaging is asynchronous, events are decoupled from the consumers, allowing for more reliable delivery of all changes.
Why use change data capture?
Change data capture platforms, like Debezium, track changes in the database by monitoring the transaction log as changes are committed. An alternative to this approach is a simple poll-based or query-based process.
CDC, based in the transaction log, provides several advantages over these options, including:
- All changes are captured: CDC is designed to capture every change made to the database. Without CDC, intermediary changes and new data, such as updates and deletes, between 2 runs of the poll loop might be missed.
- Low overhead: The combination of CDC and Kafka provides near real-time delivery of data changes. This avoids increased CPU load caused by frequent polling.
- No data model impact: Using CDC, timestamp columns are no longer needed to determine the last data update.
Use cases
The following examples represent some of the varied use cases for change data capture.
Microservices integration
CDC can be used to sync microservices with traditional, monolithic applications, enabling smooth transfer of data changes from legacy systems to microservices-based applications.
Data replication
CDC can be used for data replication to multiple databases, data lakes, or data warehouses, to ensure each resource has the latest version of the data. This way, CDC can provide multiple distributed (and even siloed) teams with access to the same up-to-date data.
Analytics dashboards
CDC can be used to feed data changes to analytics dashboards—for purposes such as business intelligence—to support time-sensitive decision making.
Auditing and compliance
To comply with strict data compliance requirements and heavy penalties for noncompliance, it is essential to save a history of changes made to your data. CDC can be used to save data changes for auditing or archiving requirements.
Cache invalidation
CDC can be used for cache invalidation to ensure outdated entries in a cache are replaced or removed in order to display the latest versions.
CQRS model updates
CDC can be used to keep Command Query Responsibility Separation (CQRS) read models in sync with primary models.
Full-text search
CDC can be used to automatically keep a full-text search index in sync with the database.
Business benefits of CDC
Change data capture can help your business make faster, data-driven decisions to reduce wasted time, effort, and revenue.
Maximize data value
CDC helps companies maximize the value of data by enabling them to leverage the information for multiple purposes. By providing a method to consistently update the same data in various siloes, CDC allows the organization to get the most out of the data while preserving data integrity.
Keep the business up to date
CDC allows multiple databases and applications to stay in sync with the latest data, giving business stakeholders the most up-to-date information.
Make better, faster decisions
CDC empowers business users to make more accurate, and faster, decisions based on the most current information. Since decision-making data often loses value rapidly, it's important to make it available to all stakeholders as immediately as possible, using CDC and Kafka. Providing access to accurate, near real-time analytics is vital to building and maintaining your competitive advantage.
Keep operations running without delays
When data in multiple systems isn't synchronized, those systems can have problems, like: reconciling orders, processing transactions, serving customers, generating reports, or following production schedules. Any one of these situations can delay your business. That means lost revenue. CDC enables your organization to keep data in sync, with low latency, across numerous systems to keep operations running smoothly.
Red Hat CDC: Debezium + Apache Kafka
Red Hat Integration delivers change data capture capabilities, through Debezium, in combination with Red Hat AMQ Streams and Apache Kafka. Debezium is a distributed open source log-based CDC platform that supports capturing changes from a variety of database systems. Debezium is fast and durable, so your applications can respond quickly without missing an event.