3 Methods for Implementing Change Data Capture

If you are familiar with Change Data Capture AKA Change Data Tracking, you can just skip the introduction below and get straight to the implementation section. But, in case you’re not, let me introduce the concept first.

American Writer Mark Twain once said: “Data is like garbage”. Today some people say: data is the new oil. But I would say: data is a blooming flower.

Like a flower, data goes through different stages, and takes time to develop, or burst into bloom. Just as the time taken may differ from flower to flower, so different data sets develop at different rates. A single flower is less striking, but when it is part of a bouquet, it becomes really eye-catching. And so it is with datathat’s how companies like Google or Facebook are able to make a profit, even without charging for their services -they simply collect data ‘flowers’ and turn them into data ‘bouquets’.

However, this article is not about flowers ? so, let’s move right on to our topic. As I explained previously, the state of data is continuously changing over time. Change Data Capture (CDC) is a set of technologies that enable you to identify and capture the previous states of the data so that later, you have a snapshot of past data that you can refer to when taking necessary action. See the example below:

CDC in MsSQL

 

With the increasing demand for big-data technologies, Microsoft introduced CDC with MsSQL Server 2008. Today, CDC is available in almost all popular database servers, including MsSQL, Oracle, CockroachDB, MongoDB, etc. But, manual implementation is just one way to get the job done. We can also automate the CDC using triggers and short procedures. In this article, I will discuss how to implement CDC both ways.

Why Change Data Capture is important

Yes, you guessed right, this technology is mostly used in the big-data domain to keep timely snapshots of streaming data in data warehouses. Here, we go through a process called ELT/ETL to insert our data into the data warehouse. This process is efficient with historic data, but when it comes to real-time data it causes too much latency to run complex queries. The easiest way to deal with real-time data is CDC, which enables us to keep our data warehouse up to date and make business decisions faster.

Let’s Get our Hands Dirty with Some Practical Examples

1. Implementing CDC with MsSQL

First, we need to create a new database.

CREATE DATABASE cdcDB

Next, I select the created database from the available databases dropdown, and create a few simple tables – just to demonstrate CDC implementation.

CREATE TABLE Student

(

StudentID int NOT NULL PRIMARY KEY,

FirstName varchar(255) NOT NULL,

LastName varchar(255) NOT NULL,

Age int,

ContactNo char(10) NOT NULL

)

After creating tables, I enable CDC.

EXEC sys.sp_cdc_enable_db

After CDC is enabled for the database, I then enable it on each table.

EXEC sys.sp_cdc_enable_table

@source_schema = ‘dbo’,

@source_name = ‘Student’,

@role_name = NULL,

@supports_net_changes = 1

Finally, I insert some values to my table by the following query.

INSERT INTO Student(StudentID,FirstName,LastName,Age,ContactNo)

VALUES

(10638389, ‘Indrajith’, ‘Ekanayake’, 21, 0713101658),

(10637382, ‘Kamal’, ‘Suriyaarachchi’, 22, 0765432210),

(10622388, ‘Kasun’, ‘Chamara’, 28, 0708998123),

(10638812, ‘Chamara’, ‘Hettiarachchi’, 20, 0772134446);

To confirm whether CDC has been properly implemented, we can just check either at our hierarchy panel or Jobs tab, under the SQL Server Agent.

To understand how and what data we are storing in the CDC tables, we can just update a few rows.

UPDATE Student

SET FirstName = ‘Janaka’, Age= ’23’

WHERE StudentID = 10638389;

So, we are now looking at the CDC CT(Change Table) and we can identify that there are additional records, as seen below.

CDC CT

 

 

2. Implementing CDC with Oracle

First, I need to mention that I’m running Oracle 11g Enterprise Edition using docker image and I’m writing queries using Oracle SQL Developer’s latest version on the host machine. The server and Oracle SQL developer are connected via port: 1521.

To implement CDC, I begin by creating a tablespace called “ts_cdcindrajith” in my cdcAssignment folder, inside my F drive.

create tablespace ts_cdcindrajith datafile ‘F:\cdcAssignment’ size 300m;

Then, I create a new user called “cdcindrajith”, and grant all permissions to this user.

 

CREATE USER cdcindrajith IDENTIFIED by cdcindrajith DEFAULT
TABLESPACE ts_cdcindrajith QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED
ON SYSAUX;

GRANT ALL PRIVILEGES TO cdcindrajith;

Next, I create a table called employees.

CREATE TABLE cdcindrajith.employees

(

EmpID int NOT NULL PRIMARY KEY,

EmpName varchar(255) NOT NULL,

Age int NOT NULL,

ContactNo char(10) NOT NULL

)

Then, I implement the CDC feature.

 

BEGIN

DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION(TABLE_NAME => ‘cdcindrajith.employees’);

END;

BEGIN

DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(

change_set_name    => ’employees_set’,

description        => ‘Change set for employees change info’,

change_source_name => ‘SYNC_SOURCE’);

END;

BEGIN

DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(

owner             => ‘cdcindrajith’,

change_table_name => ’employees_ct’,

change_set_name   => ’employees_set’,

source_schema     => ‘cdcindrajith’,

source_table      => ’employees’,

column_type_list  => ‘EmpID int,

EmpName varchar(255) ,

Age int,

ContactNo char(10)’,

capture_values    => ‘both’,

rs_id             => ‘y’,

row_id            => ‘n’,

user_id           => ‘n’,

timestamp         => ‘n’,

object_id         => ‘n’,

source_colmap     => ‘y’,

DDL_MARKERS =>       ‘n’,

target_colmap     => ‘y’,

options_string    => ‘TABLESPACE ts_cdcindrajith’);

END;

I create a subscription called “employees_sub”.

BEGIN

DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION(

change_set_name   => ’employees_set’,

description       => ‘Change data for employees’,

subscription_name => ’employees_sub’);

END;

After that, I create a View called “employees_view” and then activated the above-created subscription.

 

BEGIN

DBMS_CDC_SUBSCRIBE.SUBSCRIBE(

subscription_name => ’employees_sub’,

source_schema     => ‘cdcindrajith’,

source_table      => ’employees’,

column_list       => ‘EmpID, EmpName, Age, ContactNo’,

subscriber_view   => ’employees_view’);

END;

BEGIN

DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(

subscription_name => ’employees_sub’);

END;

BEGIN

DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(

subscription_name => ’employees_sub’);

END;

Finally, I insert data and also update some data, to check that the CDC has been correctly implemented.

INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)

VALUES

(10638389, ‘Indrajith’, 21, 0713101658);

INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)

VALUES

(10638390, ‘Kumara’, 27, 0711226661);

UPDATE cdcindrajith.employees

SET EmpName = ‘Janaka’

WHERE EmpID = 10638389;


Towards the end, I check the results of my CDC table “employees_ct” to make sure that the CDC is correctly implemented.

SELECT EmpID, EmpName FROM cdcindrajith.employees_ct;

employees ct table

 

 

3. Implementing CDC using Trigger (MsSQL)

First, we need to create a new database.

CREATE DATABASE empDB

Then, I create two tables: one is to maintain employee salary records, the other is to update the salary log, which does the change data capturing.

CREATE TABLE Salary

(

ID int identity(1,1) primary key NOT NULL,

SalDate datetime default GETDATE() NOT NULL,

Task BIGINT NULL,

PaidPerTask BIGINT NULL,

EmpName NCHAR(100) NOT NULL

)

CREATE TABLE SalaryLogs

(

ID INT PRIMARY KEY IDENTITY(1,1) NOT NULL,

SalDate DATETIME DEFAULT GETDATE() NOT NULL,

Query NCHAR(6) NOT NULL,

OldTask BIGINT NULL,

NewTask BIGINT NULL,

OldPaidPerTask BIGINT NULL,

NewPaidPerTask BIGINT NULL,

EmpName NCHAR(100) NOT NULL

)

I implement a Trigger called “salary_change” to save the salary logs.

GO

CREATE TRIGGER salary_change

ON Salary

AFTER INSERT, UPDATE, DELETE

AS

BEGIN

DECLARE @operation CHAR(6)

SET @operation = CASE

WHEN EXISTS(SELECT * FROM inserted) AND EXISTS(SELECT * FROM deleted)

THEN ‘Update’

WHEN EXISTS(SELECT * FROM inserted)

THEN ‘Insert’

WHEN EXISTS(SELECT * FROM deleted)

THEN ‘Delete’

ELSE NULL

END

IF @operation = ‘Delete’

INSERT INTO SalaryLogs (Query, SalDate, OldTask, OldPaidPerTask, EmpName)

SELECT @operation, GETDATE(), d.Task, d.PaidPerTask, USER_Name()

FROM deleted d

IF @operation = ‘Insert’

INSERT INTO SalaryLogs (Query, SalDate, NewTask, NewPaidPerTask, EmpName)

SELECT @operation, GETDATE(), i.Task, i.PaidPerTask, USER_Name()

FROM inserted i

IF @operation = ‘Update’

INSERT INTO SalaryLogs (Query, SalDate, NewTask, OldTask, NewPaidPerTask,OldPaidPerTask, EmpName)

SELECT @operation, GETDATE(), d.Task, i.Task, d.PaidPerTask, i.PaidPerTask, USER_Name()

FROM deleted d, inserted i

END

GO

Finally, I insert some data into the salary table and then try to update and delete queries as well.

INSERT INTO Salary(SalDate,Task,PaidPerTask,EmpName)

VALUES

(2020-11-20, 4, 4000 , ‘Indrajith’),

(2020-10-20, 11, 10000 , ‘Kusum’);

UPDATE Salary

SET EmpName = ‘Janaka’

WHERE EmpName = ‘Indrajith’;

DELETE FROM Salary WHERE EmpName = ‘Kusum’;

 

Towards the end, let’s check the salary logs table to understand whether the log files are updated or not.

 

salary logs table

 

 

We see here that the log files are up to date.

Summary

In short, CDC identifies and captures data that has changed in tables of a source database as a result of CRUD operations. This is useful to people who need to export their data into a data warehouse or a business intelligence application. Changed data is maintained in CDC tables in the source database.

In this article, we briefly discussed what is Change Data Capture, why CDC is useful, and most importantly illustrated three methods for implementing CDC. I believe that the rapid evolution of big-data will have many more CDC implications in the future.