July 17, 2024

Kubernetes Horizontal Pod Autoscaler (HPA)

Tania Duggal
Technical Writer

The Kubernetes Horizontal Pod Autoscaler (HPA), along with the Vertical Pod Autoscaler (VPA) and Cluster Autoscaler (CA), is one of the three main autoscaling strategies in Kubernetes.

a. Horizontal Pod Autoscaler: scaling the number of replicas of an application

b. Vertical Pod Autoscaler: adjust the resource settings (requests and limits) of a container

c. Cluster Autoscaler: scaling the number of nodes in a cluster

In this article, we will take a look at the first chapter of Kubernetes Autoscaling, i.e., Horizontal Pod Autoscaler. We will explore what HPA is, how it works, the implementation of HPA, best practices, and limitations.


What is Kubernetes Horizontal Pod Autoscaler (HPA)?


Kubernetes Horizontal Pod Autoscaler (HPA) is a controller that automatically adjusts the number of pods based on observed CPU utilization or other metrics. The goal of Kubernetes HPA  is to ensure that applications can handle varying loads efficiently by scaling out (increasing the number of pods) during high demand and scaling in (decreasing the number of pods) during low demand. This dynamic scaling helps maintain resource utilization and application performance. Kubernetes HPA continuously monitors the specified metrics and adjusts the replica count to match the desired state, ensuring that the application remains responsive.


kubernetes horizontal pod autoscaler
Kubernetes Horizontal Pod Autoscaler



How Horizontal Pod Autoscaler (HPA) Works

How Horizontal Pod autoscaling works
Horizontal Pod Autoscaler Works

Kubernetes HPA works as a control loop that runs at regular intervals, typically every 15 seconds by default. This interval can be adjusted using the

--horizontal-pod-autoscaler-sync-period parameter in the kube-controller-manager. During each interval, the K8s HPA controller queries the resource utilization metrics specified in the HPA Kubernetes configuration. The controller identifies the target resource defined by the scaleTargetRef field in the k8s HPA configuration. It then selects the pods based on the target resource's selector labels and fetches the relevant metrics. For CPU and memory metrics, the controller uses the resource metrics API. For custom metrics, it uses the custom metrics API. If you want to create your own custom metrics adapter, take a look at the starter template 

For hpa metrics of individual pod resources, such as CPU usage, the Kubernetes horizontal pod autoscaler controller collects data for each pod that it targets. When a target utilization value is specified, the controller calculates the CPU utilization as a percentage of the resource request defined for each container within the pod. It then averages the metric values across all targeted pods over a default time frame of 5 minutes (configurable) to generate a ratio, which is then used to determine the appropriate number of replicas. If any containers within a pod lack the necessary resource request settings, the CPU utilization for that pod will not be defined, and the autoscaler will not take any action based on the metric. 

HPA Kubernetes generally retrieves metrics from aggregated APIs like `metrics.k8s.io`, `custom.metrics.k8s.io`, or `external.metrics.k8s.io`. The `metrics.k8s.io` API is generally provided by an add-on called Metrics Server, which must be deployed separately.

The Kubernetes Horizontal Pod Autoscaler (HPA) uses a straightforward algorithm to adjust the number of pod replicas based on the ratio of current metric values  to desired metric values. The core formula is:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / 
desiredMetricValue )]

For example, if the current CPU usage is 300m and the target is 150m, the number of replicas will double because  300.0 / 150.0 = 2.0 . Conversely, if the current usage is 75m, the replicas will be halved because 75.0 / 150.0 = 0.5 .

Kubernetes HPA metrics:

From Kubernetes v1.30, the Horizontal Pod Autoscaler (HPA) now supports container resource hpa metrics. This means you can scale your applications based on the resource usage of individual containers within your pods. Previously, Kubernetes HPA could scale based on CPU, memory, and custom metrics, but now you can target specific containers for more granular control.

For example: You can configure K8s HPA to scale based on the memory usage of a database container within a pod, ignoring the backup sidecar container.

This is how you can define container resource metrics:

type: ContainerResource
containerResource:
	name: cpu
  container: application
  target:
  	type: Utilization
    averageUtilization: 60

In this example, the K8s HPA will scale the pods to maintain an average CPU utilization of 60% for the application container.

Note: If you change the name of a container that Kubernetes HPA is tracking, update the HPA to include both the old and new container names before rolling out the change. This ensures continuous and effective scaling during the update process. Once the update is complete, you can remove the old container name from the K8s HPA specification.

Horizontal Pod Autoscaler Configuration:

You have two ways to HPA configuration in your Kubernetes cluster. These are:

a. By kubectl autoscale command

b. By HPA Yaml files

Let’s discuss the first way to configure Horizongtal Pod Autoscaler in your k8s cluster:

kubectl autoscale deploy foo --min=3 --max=10 --cpu-percent=80

This command creates an autoscaler for deployment foo and sets the minimum and maximum number of pods, and sets the target average CPU utilization to 80%. The autoscaler will ensure there are always at least 3 replicas running and if needed, it can scale upto 10 replicas. The autoscaler will adjust the number of replicas to maintain an average CPU usage of 80% across all pods in the deployment.

Note: The scaling threshold, such as 80%, is a critical parameter that defines when the autoscaler should add or remove pods. Setting this threshold involves balancing performance and cost. A lower threshold (e.g., 70%) may lead to more frequent scaling actions, potentially increasing resource costs but ensuring better performance under varying loads. Conversely, a higher threshold (e.g., 90%) might reduce costs by scaling less frequently but could risk performance degradation during sudden spikes in load. The optimal threshold depends on the specific requirements and behavior of your application.

Now, the turn is to discuss the second way:

In production environments, it is generally preferred to use HPA YAML files over the kubectl autoscale command because YAML files can be stored

in version control systems for better tracking, auditing, collaboration, and rollback capabilities.
Before implementing HPA, your cluster has a Metrics Server deployed and configured. Run and check if your metrics-server is running and exposing metrics properly:

kubectl get apiservices v1beta1.metrics.k8s.io -o yaml

Look for the status section in the output to ensure it is available and the conditions are met.

 Or 

kubectl top pods -A

If the metrics-server is running properly, you should see the CPU and memory usage for pod

Create a deployment and expose it as a Service:

apiVersion: apps/v1
kind: Deploy
mentmetadata:
	name: busyhttp-deployment
  namespace: hpa-test
spec:
	selector:
  	matchLabels:
    app: busyhttp
   replicas: 1
   template:
   	metadata:
    	labels:
      	app: busyhttp
    spec:
    	containers:
      - name: busyhttp
      	image: otomato/busyhttp
        ports:
        - containerPort: 80
        resources:
        	limits:
          	cpu: 500m
          requests:
          	cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
	name: busyhttp-service
  namespace: hpa-test
  labels:
  	app: busyhttp
spec:
	ports:
  - port: 80
  selector:
  	app: busyhttp

Apply the deployment and service:

kubectl create -f deployment.yaml

deployment.apps/busyhttp-deployment
createdservice/busyhttp-service created

Verify the deployment:

kubectl get deploy -n hpa-test

NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
busyhttp-deployment   1/1     1            1           88s

Create the Kubernetes Horizontal Pod Autoscaler:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: busyhttp-hpa
  namespace: hpa-test
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: busyhttp-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

Or you can even generate the yaml file with this command:

kubectl autoscale deployment busyhttp-deployment --min=1 --max=10 --cpu-percent=50 --namespace=hpa-test --name=busyhttp-hpa --dry-run=client -o yaml

Apply the HPA kubernetes:

kubectl create -f hpa.yaml

Check the HPA Status:

kubectl -n hpa-test get hpa

NAME         REFERENCE                     TARGETS MINPODS MAXPODS REPLICAS     AGE
busyhttp-hpa Deployment/busyhttp-deployment 0%/50%   1       10         1            112s

Increase the Load:

kubectl -n hpa-test run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://busyhttp-service; done"

Watch the load:

kubectl get hpa busyhttp-hpa --watch -n hpa-test
NAME         REFERENCE                      TARGETS    MINPODS MAXPODS      REPLICAS     AGE
busyhttp-hpa Deployment/busyhttp-deployment  305%/50%   1        10            7           10m

Here, CPU consumption has increased to 305% of the request. As a result, the Deployment was resized to 7 replicas.

Press CTRL-C to stop the load.

Check the Kubernetes HPA Status Again:

kubectl -n hpa-test get hpa

NAME         REFERENCE                      TARGETS   MINPODS   MAXPODS      REPLICAS    AGE
busyhttp-hpa Deployment/busyhttp-deployment  0%/50%    1                 10            1          20m

The load is normal now, hence the replicas are back down to 1. This doesn't happen immediately due to the cooldown period (defaults to 5 min) configured to stabilize the scaling behavior.

Clean Up the Resources:

kubectl delete ns hpa-test --cascade

namespace "hpa-test" deleted
›› Learn about the pros and cons of horizontal and vertical scaling to optimize your system's performance.

Kubernetes Horizontal Pod Autoscaler Best Practices

1. Attach the K8s HPA to a Deployment object rather than directly to a ReplicaSet or ReplicationController. Deployments provide a higher level of abstraction and manage ReplicaSets for you, making it easier to handle updates and rollbacks.

2. Kubernetes Horizontal Pod Autoscaler also works for StatefulSets. If you have configured an HPA for a workload, it is recommended to allow the HPA to manage the replica count automatically rather than manually adjusting it. For more details, read here.

2. Use declarative YAML files to define your Kubernetes HPA resources. This approach allows you to version-control your configurations, making it easier to track changes and maintain consistency across environments.

3. Always specify resource requests for your pods. Resource requests inform the Kubernetes scheduler about the minimum resources required for your pods to run efficiently. This information is necessary for the K8s HPA to make informed scaling decisions.

4. Select metrics that accurately reflect the load on your application, such as CPU or memory usage. You can also use custom metrics if they provide better insights into your application's performance.

5. Use monitoring tools like Prometheus and Grafana to keep an eye on your HPA's K8s performance.

6. Use stabilization windows to prevent rapid scaling up and down, which can lead to instability. Cooldown periods give the system time to stabilize before making further scaling decisions. To learn more about the Stabilization window, read here.

Kubernetes Horizontal Pod Autoscaler Limitations


1. HPA in Kubernetes cannot be used with VPA when both are based on CPU or memory metrics. This is because VPA adjusts resource requests and limits, which can conflict with K8s HPA's scaling decisions. To avoid this conflict, HPA must rely on custom metrics if VPA is enabled.

2. Kubernetes Horizontal Pod Autoscaler does not consider IOPS (Input/Output Operations Per Second), network bandwidth, or storage usage in its scaling decisions. This limitation can expose applications to performance bottlenecks or outages if these resources become constrained.

3. HPA in Kubernetes does not address the issue of resource waste within the Kubernetes cluster. Administrators are still responsible for identifying and managing unused or over-provisioned resources at the container level, which can lead to inefficiencies and increased costs.

4. Kubernetes Horizontal Pod Autoscaler scales at the pod level, which means it adjusts the number of pod replicas based on resource utilization metrics like CPU and memory. While this approach works well for many applications, it may not provide the fine-grained control needed for certain applications that require more precise resource management. For more granular control, consider using custom metrics or other autoscaling mechanisms.

5. Without proper cooldown periods, HPA in Kubernetes can cause rapid scaling up and down, leading to instability. Administrators need to carefully configure stabilization windows to prevent "flapping" and ensure stable scaling behavior.


Auto-scaling is a proactive strategy that allows applications to meet unpredictable demands and maintain reliability. By distributing the load, it minimizes the risk of a single point of failure. It ensures the application remains available and operational. For example, during peak traffic periods, additional instances can be spun up to handle the increased load, preventing performance degradation and ensuring a seamless user experience. Conversely, during off-peak times, scaling down reduces unnecessary resource consumption, leading to cost savings. This balance between performance and cost is crucial for businesses aiming to deliver consistent service levels while managing operational expenses effectively.

So, why wait?

Start exploring Kubernetes HPA and take your containerized applications to new heights of scalability and performance!

Ready to elevate your Kubernetes management to the next level?

With PerfectScale, you can harness the full potential of Kubernetes Horizontal Pod Autoscaling while significantly reducing your cloud costs and enhancing system resilience. Our advanced algorithms and machine learning techniques ensure your services are precisely tuned to meet demand, cutting down on waste and optimizing every layer of your K8s stack. Join industry leaders like Paramount Pictures and Creditas who have already optimized their Kubernetes environments with PerfectScale.

Start a free trial now and experience the immediate benefits of automated Kubernetes cost optimization and management, ensuring your environment is always perfectly scalable.

Elevate your Kubernetes management to the next level

PerfectScale Lettermark

Reduce your cloud bill and improve application performance today

Install in minutes and instantly receive actionable intelligence.
Subscribe to our newsletter
Mastering Kubernetes Horizontal Pod Autoscaler (HPA): Explore what HPA is, how it works, the implementation of HPA, best practices, and limitations.‍
This is some text inside of a div block.
This is some text inside of a div block.

About the author

This is some text inside of a div block.
more from this author
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.