Scaling Out GenAI with Message Queues on Kubernetes

Load balancers are a staple of scalable, high-throughput, high-availability architectures. They work great to scale web services. When requests take longer, though, things get complicated. Requests can pile up on some backends; bursts of traffic can send the latency through the roof; and when autoscaling kicks in, it might be too late and/or too expensive.

Asynchronous architectures and message queues can help a lot here combined with event-driven autoscaling.

‍

Session Overview

We're going to see how to implement that pattern on Kubernetes, leveraging:

- A popular LLM to generate thousands of completions;
- RabbitMQ and PostgreSQL to store requests and responses;
- Bento to implement API servers, producers, and consumers without writing code;
- Prometheus, Grafana, and KEDA for observability, dashboard, and autoscaling;
- Helm and Helmfile to automate deployment as much as possible.

Who should watch:

- DevOps, Platform, and SRE professionals looking for ways to improve their autoscaling practices.
- Data engineers who want a better understanding of running their workloads on Kubernetes.

‍

Meet our experts

Jerome Petazzoni‍

Tinkerer Extraordinaire

Part of the Docker founding team. Docker Community Advocate between 2013 and 2018. These days he teaches Kubernetes at Enix, a French Cloud Native shop.

When he's not busy with computers, he collects musical instruments, and can arguably play the theme of Zelda on a dozen of them.

‍

Anton Weiss

Chief Storyteller PerfectScale

Anton has a storied career in creating engaging and informative content that helps practitioners navigate through the complexities of ongoing Kubernetes operations. With previous experience as a CD Unit Leader, Head of DevOps, and CTO and CEO he has worn many hats as a consultant, instructor, and public speaker. He is passionate about leveraging his expertise to support the needs of DevOps, Platform Engineering, and Kubernetes communities.

‍