It’s Karpenter Time!
If you’re reading this blog - it means one of 2 things. Either: you’re considering the migration to Karpenter, or: you’re already using Karpenter and need to further optimize it.
In both cases - you’re on the right path! In 2024 it definitely makes no sense to continue using the good old cluster-autoscaler. It has served us well in the past but it has downsides - dependent on predefined auto-scaling groups and only supporting scaling out on similarly sized nodes. It did the job but it wasn’t fast, smart or cost-effective.
But now we have Karpenter!
Karpenter gives us cost-aware node provisioning that’s based on actual workload resource requests. And it also has the ability to consolidate nodes whenever it detects idle capacity across the cluster. All in all - a great, smart piece of software.
Organizations that migrated to Karpenter cite up to 40% cluster costs reduction and improved autoscaling performance. (Link at the bottom of the post).
And on top of all that - Karpenter is totally free and open-source. The project has recently joined the CNCF SIG Autoscaling and transitioned to the beta phase. It already has almost 5K GitHub stars and contributions from 200+ developers. Originally it was only working for AWS EKS but last month Microsoft announced the preview version of their NAP (Node Auto Provisioning) solution based on Karpenter and their infrastructure provider for Azure. (Link at the bottom of the post).
If you’re running on GKE btw - they have their own NAP (Node Auto Provisioning) solution. It’s not Karpenter but it works pretty similar to Karpenter - so the benefits described further in this article apply to GKE too (i.e - you can further reduce your costs and improve autoscaling performance)
In the past some organizations have turned to 3rd party commercial autoscalers (such as CAST.AI or Spot by NetApp) in order to address cluster-autoscaler shortcomings. But now that Karpenter has reached maturity - there’s no practical reason to pay for an autoscaler license and risk getting vendor-locked in a custom solution.
In other words - if you’re still using cluster-autoscaler (or any other autoscaler for that matter) - stop!
It’s time to migrate to Karpenter. Especially as doing so is pretty straightforward and can be done without disrupting your users or your engineers - as described in the excellent blog post by Grafana Labs. (Link at the bottom of the page)
Optimizing Karpenter ROI
Yes, Karpenter is awesome, it gives us smart, just-in-time node provisioning.
But what it doesn’t have is a deep understanding of your workloads and their actual and historical resource utilization and reliability needs. And that’s where PerfectScale shines.
In fact - we’ve been able to provide our customers with additional 30 to 50% cost reduction on top of what they already achieved with Karpenter.
Yes - you heard it right - Karpenter can be much more cost-effective when it runs on optimized clusters.
As an example - when a number of small pods are pending - Karpenter can get over-eager and schedule a number of relatively small nodes (one per pending pod) which together would accumulate to substantial idle capacity. Yes, at a certain point Karpenter’s consolidation will kick in and densify the nodes - but we’ve already paid for extra resources and the required pod evictions will impact our application reliability. PerfectScale on the other hand will help you to identify such scenarios and reconfigure Karpenter to prevent this from happening altogether.
Or as in the example below - customer workloads were overprovisioned on CPU across the board with CPU requests on average 2x higher than the actual utilization. After analyzing the workloads for a while - PerfectScale automation was activated on April 15. Since that point we can see how requests gradually get much closer to actual utilization which in turn causes Karpenter to provision smaller nodes with less wasted resources.
And when we correlate this with the graph at the bottom we can also see it gets translated to lower overall nodecount.
Here’s how PerfectScale augments Karpenter’s autoscaling:
- PerfectScale’s PodFit container right-sizing automation updates your workload resource requests based on actual utilization. In this way - when Karpenter provisions new nodes their sizes are based on what containers actually need, not on what the developers have guesstimated.
- With the newly released InfraFit - you’re now able to see what nodes Karpenter provisions and how efficient its bin packing algorithms are and improve on them by recomposing your Karpenter NodePools. All this based on PerfectScale timeline data that correlates Karpenter autoscaling events and node utilization percentages.
- With InfraFit it also becomes very straightforward to monitor your actual Spot/OnDemand ratios and their cost effectiveness and update NodePools accordingly.
By the way - these benefits can be achieved when combining PerfectScale with all of the mentioned cluster autoscalers. But the described Karpenter+PerfectScale duo is the most cost-effective of them all as you’re only paying for the PerfectScale license.
So yes - PerfectScale helps boost your Karpenter ROI. You don’t need to choose one or, god forbid, migrate. You can keep your Karpenter investment and improve its performance and cost-effectiveness. Karpenter and PerfectScale are better together.
Kaprenter is a Community
And yes - we are also running Karpenter in-house. In fact we’ve gained significant operational expertise with the tool. Scroll further for the link to the blog post describing how we monitor Karpenter and to download the Karpenter-specific Grafana dashboards we’ve created for the community.
Contact us today for a free cluster health check. Our Kubernetes pros will happily help you optimize your Karpenter configuration to perfection.
Links:
- Anthropic slashed 40% of their K8s costs with Karpenter: https://www.thestack.technology/aws-anthropic-cloud-bill-eks-karpenter/
- AzureNAP ( (Karpenter) in public preview: https://azure.microsoft.com/en-us/updates/public-preview-node-autoprovision-support-in-aks/
- How Grafana Labs switched to Karpenter from cluster-autoscaler: https://grafana.com/blog/2023/11/09/how-grafana-labs-switched-to-karpenter-to-reduce-costs-and-complexities-in-amazon-eks/
- How we monitor Karpenter at PerfectScale: https://www.perfectscale.io/blog/karpenter-monitoring-with-prometheus