Implementing Change Data Capture to Power Near Real-time Analytics

25 January 2022

by Joao Rocha and Natalia Gavrilova

A priority we’ve had for a while is to reduce the time it takes for a commercial interaction with our products to be available for analysis; people want real-time analytics. We want to respond quickly to our customers whether this means new customer signups, upgrades, or downgrades — near real-time analytics changes should be reflected within the Leadfeeder platform.

We mostly use batch processing, and for some datasets, we narrowed the batch frequency to one hour (still not ideal). I could go on and on comparing batch versus stream processing, but I’ll leave that for some other time.

Let’s just say our users want to see what’s happening in the business in real-time, or as close to real-time as possible.

How does real-time analytics work?

As part of Leadfeeder’s AWS-enabled analytics serverless pipelines, our engineering team integrates data from relational databases into Redshift. Our analysts use this data for multiple purposes, including business and real-time analytics within the product.

Previously we were only using AWS Glue to bring our operational data into Redshift to drive analytics. The main disadvantages of our legacy process are:

We were integrating snapshots of data, so we don’t have visibility of the changes between snapshots.
Given the batch nature of the pipeline, data is available with a considerable delay. Any batch failures can cause delays to be further extended.

This legacy process also has its advantages, it was quite fast to develop and quite easy to extend.

Using AWS Glue and its crawlers we can easily identify tables and schema changes in the source. Bringing new tables into Redshift becomes a simple modification.

However, as our use cases get more complex (our data analysts and scientists tend to be quite creative), this pattern became insufficient mainly due to its snapshot nature.

CDC Pipeline for real-time analytics

CDC stands for Change Data Capture; it comprises software patterns to detect changes in a source system and propagate them downstream for further processing and/or storage.

MySQL has a feature called the binary log, which registers events that modify the database, whether DDL or DML statements and serves two main purposes, replication, and data recovery.

There are multiple ways to implement CDC, in our case we are using MySQL’s binary log to create a stream of events that we can then use to power our analytics.

Producer

To tap into the binary log and create a stream of data we implemented an ECS Service that continuously reads the binary log and acts as a producer for a Kinesis stream.

Main features of the ECS Service:

The data read from the database is transmitted without any transformation and serialized as a new line delimited JSON. The goal is to produce raw events, i.e. whatever DML operation is detected from the database, it will be sent to the Kinesis stream as-is, one event per modified row in the database.
Events are produced to Kinesis stream synchronously and ordered in shards per database table.
Allows restarting from a specific position in the binary log. For MySQL RDS instances the maximum retention for the binary log is 7 days.
Implements a graceful restart mechanism. Upon capturing SIGTERM, flushes data and saves a checkpoint. This is important for deployments.
Fargate is used to continue with our serverless approach to analytics and limit any operational activity to configure and maintain compute resources.
Monitoring using Cloudwatch metrics and alarms.

Consumers

In order to consume data from this stream of events, we decided to use Kinesis Firehose.

Firehose allows us to accelerate the implementation of the pipeline, as it provides data consumer features. All we have to do is define the configuration for our delivery streams. 🥊

From the diagram above, we can see we have two main consumers, one for Redshift and one for S3.

To power our near real-time analytics, we only needed to stream our data into Redshift, however, we quickly saw the value of also streaming our operational data into S3 so we could enable additional integration patterns into our Data Warehouse.

Not all datasets require real-time analysis but we wanted to track all changes to create the complete changelog of events in our operational databases. From this changelog, we are able to support more complex modeling.

Stay tuned for more engineering updates in the future! 👩🏻‍💻

By Joao Rocha

João is a self-motivated engineer, passionate about data warehousing and data engineering. He enjoys discussions about technology and creative ways to approach complex problems. If you favor the same topics, hit him up on LinkedIn.

and by Natalia Gavrilova

Natalia has worked with data for almost 10 years as engineer, analyst, and scientist. Talk numbers with her on LinkedIn.

Get more from your web analytics.

t’s time to turn your website traffic data numbers into something more meaningful. Website visitor analytics enable you to identify and qualify the companies visiting your website, even when they don’t fill out a form.

Show me how