60-Second Summary
Leadfeeder shifted from hourly batch snapshots to a Change Data Capture (CDC) pipeline to deliver near real-time analytics. The team streams MySQL binlog events via an ECS/Fargate producer into Kinesis, then uses Kinesis Firehose to deliver data to Redshift and S3.
Key takeaway: CDC replaces snapshot-based batches to provide near real-time visibility, full changelogs, and reduced analytics latency while preserving the ability to extend datasets.
Standout strategies & tactics: Use an ECS/Fargate service to read MySQL binlog and produce newline-delimited JSON events, keep per-table ordering in Kinesis shards, and rely on Firehose for fast delivery to Redshift and S3.
Real-world lessons & frameworks: Balance quick-to-build batch pipelines (AWS Glue) with CDC for richer, timely insights; adopt serverless components to minimize Ops and add monitoring, checkpointing, and graceful restarts.
Operational notes: Account for RDS binlog retention (7 days) when designing checkpoints and restarts, monitor via CloudWatch, and choose which datasets truly need real-time vs archival changelogs in S3.
*This summary was created with AI assistance, using our original content.
We want to respond quickly to our customers, whether that means new customer signups, upgrades, or downgrades — near-real-time analytics changes should be reflected in the Leadfeeder platform.
We mostly use batch processing, and for some datasets, we set the batch frequency to 1 hour (still not ideal). I could go on and on comparing batch and stream processing, but I’ll leave that for another time.
Let’s just say our users want to see what’s happening in the business in real time, or as close to it as possible.
How does real-time analytics work?
As part of Leadfeeder’s AWS-enabled analytics serverless pipelines, our engineering team integrates data from relational databases into Redshift. Our analysts use this data for multiple purposes, including business and real-time analytics within the product.
Previously, we used AWS Glue to load our operational data into Redshift to drive analytics. The main disadvantages of our legacy process are:
We were integrating snapshots of data, so we don’t have visibility of the changes between snapshots.
Given the batch nature of the pipeline, data is available with a considerable delay. Any batch failures can further delay the process.
This legacy process also has its advantages. It was quite fast to develop and quite easy to extend.
Using AWS Glue and its crawlers, we can easily identify tables and schema changes in the source. Bringing new tables into Redshift becomes a simple modification.
However, as our use cases become more complex (our data analysts and scientists tend to be quite creative), this pattern has become insufficient mainly because of its snapshot nature.
CDC Pipeline for real-time analytics
CDC stands for Change Data Capture; it comprises software patterns to detect changes in a source system and propagate them downstream for further processing and/or storage.
MySQL has a feature called the binary log, which registers events that modify the database, whether DDL or DML statements and serves two main purposes, replication, and data recovery.
There are multiple ways to implement CDC. In our case, we use MySQL’s binary log to create a stream of events that powers our analytics.
Producer
To tap into the binary log and create a data stream, we implemented an ECS Service that continuously reads the binary log and serves as a producer for a Kinesis stream.
Main features of the ECS Service:
The data read from the database is transmitted untransformed and serialized as new-line-delimited JSON. The goal is to produce raw events; i.e., for each DML operation detected in the database, send it to the Kinesis stream as-is, one event per modified row.
Events are produced to the Kinesis stream synchronously and ordered in shards per database table.
Allows restarting from a specific position in the binary log. For MySQL RDS instances, the maximum retention for the binary log is 7 days.
Implements a graceful restart mechanism. Upon capturing SIGTERM, it flushes data and saves a checkpoint. This is important for deployments.
Fargate is used to continue our serverless analytics approach and limit operational activity to configuring and maintaining compute resources.
Monitoring using Cloudwatch metrics and alarms.
Consumers
To consume data from this stream of events, we decided to use Amazon Kinesis Firehose.
Firehose allows us to accelerate the implementation of the pipeline by providing data-consumer features. All we have to do is define the configuration for our delivery streams.
From the diagram above, we can see we have two main consumers, one for Redshift and one for S3.
To power our near-real-time analytics, we only needed to stream our data into Redshift; however, we quickly saw the value of also streaming our operational data into S3 to enable additional integration patterns in our Data Warehouse.
Not all datasets require real-time analysis, but we wanted to track all changes to create the complete changelog of events in our operational databases. From this changelog, we can support more complex modeling.
Stay tuned for more engineering updates!