Emr and Data Processing Power - Let's Be Clever About It

60-Second Summary

Apply the core ideas behind AWS Lambda to AWS EMR to get on-demand infrastructure and scalable compute for heavy data processing while avoiding idle costs and Lambda limits. Use dynamic cluster lifecycle control and EMR auto-scaling to create an "almost-Lambda" for large jobs.

Key takeaway: Combine "infrastructure on demand" and "scalability on demand" in EMR to run heavy workloads only when needed, cutting idle costs and overcoming Lambda memory and timeout constraints.
Standout tactic — Infrastructure on demand: Create clusters dynamically (triggered by S3 object events or CloudWatch schedules), submit work as EMR steps (run_job_flow or add_job_flow_steps), set ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION', and use actionOnFailure='TERMINATE_CLUSTER' to prevent cost creep.
Standout tactic — Scalability on demand: Attach EMR auto-scaling policies with at least two rules (scale-out and scale-in). Example: scale out when available memory <15% for 5 minutes; scale in when available memory >75% for 3 minutes, adjusting task instance groups by defined counts.
Real-world lesson: Treat EMR as a "slow Lambda"—it takes ~10 minutes to provision but offers near-unlimited compute, memory, and runtime. Start small, rely on auto-scaling, and weigh bootstrap time against cost and performance needs.

*This summary was created with AI assistance, using our original content.

It was when I was introduced to AWS Lambda functions. The design of the AWS Lambda service dictates certain limitations. Execution timeout and memory available are good examples. In my case, those most often rendered Lambda functions as an inappropriate tool for the task.

Still, the ideas behind the Lambda service are clever, generally applicable, and useful. Two are particularly interesting to me…

Infrastructure on demand, because there’s no need for infrastructure to exist if it’s not going to be used. To emphasize this principle , there is no need to pay for infrastructure if it’s idling.

Scalability on demand, because sometimes, there’s a need for more computing power. Now, off to making AWS EMR our super Lambda …

Data processing that demands high computational power isn’t cheap. The more data there is to process, the more it costs to store, scan, and derive meaningful insights from it. AWS EMR (Elastic MapReduce) is one of the tools I use most to tackle problems of this nature. I can tell you, the costs of using it can pile up quite fast.

About AWS EMR

AWS EMR service provisions clusters of computers and provides us with their computational power. This can also tell us what a proper use case for EMR could be. A very loose definition… It could be a choice when traditional data querying/processing tools are not giving results in a reasonable time.

For example, a SQL query fired against MySQL gives results in twelve hours. Apache Spark application or Apache Hive query, which both can run on AWS EMR, might be a better choice.

A single EMR cluster consists of several components:

Master node (or up to three master nodes): A master node manages the cluster and runs the cluster resource manager. It also, as AWS docs say, “runs the HDFS NameNode service, tracks the status of jobs submitted to the cluster, and monitors the health of the instance groups”.
Core nodes: Perform computational tasks and coordinate data storage in HDFS. They are managed by the master node. There can be only one core node instance group.
Task nodes: Task nodes are the basic foundation of cluster computational power. As a result, they only perform computational tasks. There can be up to 48 task node instance groups, each with a uniform instance type.

A minimal EMR cluster would have a single master node and, let’s say, two core nodes. A reasonable master node could be an m5.xlarge instance. Core nodes can be, for example, r5.xlarge instances.

This is already not a trivial setup, let alone its cost, compared to an AWS Lambda function.

We can add to that a task node instance group… r5.2xlarge or r5.4xlarge instances. That's getting very expensive quickly!

This is where we can “steal” from AWS Lambda. We can incorporate those two principles into our use of AWS EMR.

EMR infrastructure on demand

1. Cluster creation

We can create an EMR cluster only when needed and shut it down when it is no longer needed.

dynamic-cluster-creation — Dynamic cluster creation

As shown in the image above, an EMR Cluster can be created dynamically when needed. That can be triggered by any event or message from any AWS service.

For example, we want to create an EMR cluster. It should happen when data arrives in AWS S3, i.e., when an object is created in AWS S3.

Another use case would be to create an EMR cluster as part of scheduled processing. This can be triggered by an AWS CloudWatch cron event.

Regardless of which event we decide to react to, we need a tool to run our creation mechanism. It can be a Step function, a Batch job, or a Lambda function, for instance.

2. Workload submission

Once an EMR cluster is created, the workload can be submitted as an array of EMR steps. Using the Python Boto3 library, for instance, it can be done either during cluster creation

emr_response = emr.run_job_flow( ..., Steps=steps_definition, … )

or through a separate call to AWS EMR service

emr.add_job_flow_steps( JobFlowId='string', Steps=[...] )

3. Cluster destruction

What is going to happen with the cluster when it is done with the work? We might want to leave it up and running for possible future workload submission. But let’s assume that is not the case — we want to shut it down.

In that case, there is a detail that needs to be taken care of upon cluster creation:

emr_response = emr.run_job_flow( ..., ScaleDownBehavior="TERMINATE_AT_TASK_COMPLETION" )

By specifying this scale-down behavior for EMR, the cluster will be destroyed when all the work is done.

4. Step failure

What happens if a single EMR step fails?

We don’t want to leave that cluster running idle just because it didn’t reach the last step and shut down gracefully.

What we do want is to define its behavior in case of a step failure by specifying for each step submitted:

"actionOnFailure": "TERMINATE_CLUSTER"

By doing all this, we can be sure we are safe from AWS cost creep. Also, the principle of “Infrastructure on demand” can be considered implemented.

Scalability on demand

How to implement the scalability on demand principle? By using EMR’s built-in auto-scaling feature. Attaching an auto-scaling policy to an EMR cluster enables it to scale up or down based on demand.

The auto-scaling policy should have at least 2 rules, but can have more. One rule should tell the cluster when to grow, and another when to shrink.

The scale-out rule could be something like: “if available cluster memory becomes less than 15% and stays like that for more than 5 minutes, this cluster should grow one of its task instance groups by 5 instances”.

A scale-in rule could be something like: “if available cluster memory becomes more than 75% and stays like that for more than 3 minutes, this cluster should shrink the same task instance group by 3 instances”.

A proper auto-scaling policy definition for an EMR cluster is shown below. here.

Implementing this principle puts us in a position to have a workflow of:

starting with a cluster of a minimal size
When the cluster’s workload comes in and becomes too heavy for its current size, the cluster grows
When the load on the cluster wears off, it shuts down some of the worker instances and shrinks in size

EMR in summary

These two principles applied together almost make an EMR cluster a kind of AWS Lambda function. A "Lambda" that is slow to start, but has almost limitless computational power, memory, and processing time.

As with all things, something’s gotta give a significant bootstrap time. It takes around ten minutes to provision a cluster.

If the bootstrap time is not an issue, the EMR approach allows you to unleash automatically scaled and provisioned processing power while also bypassing the Lambda’s memory and timeout limits.

Jamie Pagan

Director of Demand @ Leadfeeder

Jamie Pagan is Director of Demand at Leadfeeder, where he leads demand generation and pipeline growth initiatives. His work focuses on connecting marketing activity with revenue by combining intent signals, campaign performance data, and audience insights.

With experience building scalable demand engines and launching growth-focused campaigns, Jamie brings a practical perspective on how marketing teams generate and capture demand. His experience working with intent data and marketing analytics informs his approach to identifying high-intent buyers and converting interest into qualified opportunities.

EMR and Data Processing Power — Let's Be Clever About It

EMR and Data Processing Power — Let's Be Clever About It

60-Second Summary

About AWS EMR