February 12, 2025

Autonomous Ephemeral Workload Optimization from PerfectScale

Ira Chernous
Technical PMM & Documentation Specialist

Ephemeral workloads, those directed at data processing and AI/ML, have significantly increased their share due to the rapid development of dynamic workflow-based applications in which short-lived workloads play a key role. These processes demand highly scalable and flexible computing resources to handle their temporary but often resource-intensive tasks. Kubernetes pods are ephemeral and immutable by design, which makes them a great fit for this type of application. However, when we move beyond standard Kubernetes workloads such as Deployments or DaemonSets - new challenges can arise, and unfortunately, traditional Kubernetes models and strategies may fall short. 

If you’re here, you’ve likely faced challenges managing dynamic K8s environments and perhaps even have been hit with questions like:

  1. Am I managing ephemeral workloads efficiently without over-provisioning or risking resource shortages?
  2. Is my monitoring properly configured to track performance and resource usage effectively?
  3. If yes to 1 and 2, are there ways to confirm it with trusted data points?

Let's dive deeper into ephemeral workloads, explore the challenges you might encounter, and find actionable strategies to solve them.

What do we call “ephemeral workloads”?

As already noted - all Kubernetes pods are ephemeral by nature (unless they are part of a StatefulSet). However, when we talk about ephemeral workloads, we refer to the (relatively) short-lived, stateless tasks that use resources dynamically and are not owned by a standard API resource such as Job or CronJob. Essentially - these are standalone pods created by custom controllers such as Knative, Airflow Kubernetes Executor, and Spark Operator. Or also CI/CD execution pods triggered by Gitlab or Jenkins. These pods run only as needed and shut down automatically once they are finished, which perfectly suits dynamic environments.

Orchestration of this type of workload is essential for handling dynamic, short-lived processes in data and AI pipelines. They are helpful for big data processing and managing machine learning tasks such as model training. Here, we can make a conclusion that these workloads provide the flexibility and scalability to handle complex, temporary, and resource-intensive operations in a more efficient way. 

But are your dynamic workloads efficient, indeed?

Why do ephemeral workloads drive my expenses up?

Unexpected spikes in costs may occur while running your AI/ML applications. These can often catch you off guard, especially when utilizing ephemeral (short-lived) workloads that are constantly scaling to meet demand. Whether the app is scaling up too quickly or not, using resources effectively can cause a headache. Understanding why ephemeral workloads can drive up costs is the first step to controlling your cloud expenses and ultimately driving efficiency.

We took a closer look at these cases and uncovered several key reasons why a solution that seems efficient at first glance can quickly turn into a major driver of cloud waste.

We noticed that many customers experience challenges tracking resource usage in real-time because of the dynamic nature of ephemeral workloads, which tend to be unpredictable regarding when they will be running and how much resource they will consume. The lack of visibility often leads to over-provisioning during high-demand periods and under-utilization during low-demand periods, resulting in wasted costs. Since these workloads can be highly fragmented and spun up as needed, traditional monitoring strategies are no longer effective, and unfortunately, in many cases, teams only see the tip of the iceberg.

Our team also noted a sort of ‘guilty pleasure’ caused by a lack of clear understanding of dynamic workloads and demands during peak traffic times, which often leads to inefficiencies. Over-provisioning to ensure performance during busy periods - when the workload may not fully utilize the resources - is a well-known example.

Last but not least, there’s the situation when ephemeral workloads finish their tasks but continue to occupy resources such as memory or CPU, which are not fully utilized, leading to unnecessary expenses. Without automation to quickly scale down unused resources, costs can increase without delivering any value.

Boost efficiency, not K8s costs!

Clarifying the issue creates a focus for addressing it and emphasizes the need for the right solution that can cover the mentioned aspects. Inspired by existing challenges, we're excited to introduce a unique approach for autonomously right-sizing ephemeral workloads from PerfectScale by DoiT. 

PerfectScale's automation enables effortless optimization of K8s workloads, even in dynamic and complex environments. It eliminates unused capacity, significantly reduces cloud costs, and ensures peak performance without latency or bottlenecks.

Two simple steps are needed to start optimizing ephemeral workloads like SparkJobs, Airflow, Temporal, and others. 

Step 1: Group ephemeral workloads

PerfectScale provides advanced intelligence to group dynamic workloads that facilitate data aggregation for transient entities, enabling you to unlock better visibility and analysis, focus on what matters with actionable data points, and ultimately reduce cloud expenses. All you need to do is add the necessary labels to your workloads. Once it’s in place, PerfectScale will automatically take care of the rest of the process for you.

Labels:
	app: spark-job-over-hri
	automation.perfectscale.io/generatedFrom: 53836ca7
	perfectscale.io/workload-grouping-honor-image: “true” 
	perfectscale.io/workload-grouping-honor-spec: “true”
	perfectscale.io/workload-grouping-workload-name: spark-job-over-hri
	perfectscale.io/workload-grouping-workload-type: CustomSparkJob

The API server receives the deployment request. It stores the deployment specification in etcd. The replicaset controller creates a replicaset and then the Pod controller creates the pods and places them in a Pending state in the scheduling queue.

Once your workloads are grouped into a workload with, let’s say, Custom_SparkJob type, you can easily cut through the noise of chaotic data, mitigate their unpredictable nature, and manage ephemeral workloads in a more effective manner, or move to the step further to fully automate the process and achieve instant results.

Step 2: Configure flexible automation

You're now one step away from fully automating your optimization flow. The only missing piece is configuring automation for your Custom_SparkJob workload. At this stage, PerfectScale provides a range of highly customizable automation configuration options that you can apply by setting up a Custom Resource (CR). 

Configure flexible automation
Configure flexible automation

This approach provides flexibility and enables you to configure the automation precisely, making it easy to adjust to your specific use cases and application operational needs.

Following this, our AI-powered algorithm and sophisticated analytics enhance the prediction of workload patterns and adjust resource allocation using historical data.

How can this feature work for me?

Usecase 1: Airflow Kubernetes Executor

Apache Airflow® is the leading open-source platform for orchestrating data pipelines and ML workflows. Airflow manages the workflows as directed acyclic graphs (DAG) consisting of discrete tasks. Kubernetes executor for Airflow allows each task instance to run in its own pod on a Kubernetes cluster.

When a DAG submits a task, the Kubernetes executor requests a worker pod from the Kubernetes API. The worker pod then runs the task, reports the result, and terminates.

Airflow Kubernetes Executor
Airflow Kubernetes Executor

Just as for any other pod - resource allocation for Python-based Airflow tasks isn’t trivial. Some may need more memory to load big chunks of data, while others may be more CPU-intensive because of complex calculations. Defining the container resource requests and limits beforehand is hard due to the dynamic nature of data processing tasks. In fact - most data engineers will probably opt out of defining this. If you look at the official examples in Airflow Kubernetes Executor documentation - none of them features resource specifications. On the other hand - measuring this is hard due to the ephemeral nature of Airflow worker pods.

The only way to manage resource allocation for Airflow workers is to group them by some common property (e.g., pod label) and automatically set their resource requests and limits when they get created in the cluster. This way, we can get recommendations based on the metrics of previously existing pods and apply them to the new pods created by Airflow. Luckily, that’s exactly what the new grouped workloads automation feature of PerfectScale enables you to do effortlessly.

Usecase 2: Spark Driver and Executor Pods for Spark Operator

The Kubernetes Operator for Apache Spark aims to make specifying and running Spark applications as easy and idiomatic as running other workloads on Kubernetes. It uses Kubernetes custom resources to specify, run, and surface the status of Spark applications.

Whenever a SparkApplication custom resource is created or updated,  the Spark controller acts on the watch events, which causes the creation of Spark driver and executor pods.

Let’s explore the following diagram, which shows how different components interact and work together.

Kubernetes Cluster
Kubernetes Cluster

While Spark driver pods are pretty uniform in their resource consumption, the executor pods process dynamic payloads and can be very inefficient if they run with a static configuration. Here’s an example of how the Spark operator configures executor pod resource allocation:

```
spec:

 executor:

   cores: 1

   instances: 3

   memory: 512m

   labels:

     version: 3.1.1

   serviceAccount: spark
```

One can see that this only allows setting requests and not limits and that the definitions are static. In order to allocate the correct amount of resources for the ephemeral Spark workloads, we can initially skip resource allocation for the executor pods, group them in PerfectScale by labels that we apply e.g.: 

perfectscale.io/workload-grouping-workload-name=SparkJobName
perfectscale.io/workload-grouping-workload-type=SparkJob
perfectscale.io/workload-grouping-honor-image=true 
perfectscale.io/workload-grouping-honor-spec=true

Then, we can seamlessly enable automation to set resources for executors whenever they are created by the operator.

Usecase 3: Jenkins Jobs

The Jenkins Kubernetes plugin allows the scaling of Jenkins agents running in Kubernetes. The plugin creates a Kubernetes Pod for each agent that is started and stops after each build.

Jenkins Jobs
Jenkins Jobs

While Jenkins pod templates allow setting the resource allocations per container, it’s very hard to get right. What we can do is set pod labels, group all the relevant Jenkins agent pods in PerfectScale, and automate their resource allocation according to their actual utilization.

In conclusion, managing, scaling, and optimizing dynamic AI/ML applications is not trivial. It can be challenging for teams and may not deliver the desired results. Our team believes the right tools and strategies can bring these processes to new heights and achieve goals effectively.

Discover more in our documentation, or schedule a technical session with our team to see how this feature could help you.

Still not with PerfectScale by DoiT? Start today - it is free!

PerfectScale Lettermark

Reduce your cloud bill and improve application performance today

Install in minutes and instantly receive actionable intelligence.
Subscribe to our newsletter
PerfectScale offers autonomous right-sizing for ephemeral workloads like Airflow, Spark Jobs, and others, ensuring seamless optimization for dynamic environments.
This is some text inside of a div block.
This is some text inside of a div block.

About the author

This is some text inside of a div block.
more from this author
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.