With the rise of LLMs and inference workloads, a new type of routing has gained traction. In this post, I will share some insights on how this works and why it’s needed. It will be a technical deep dive, so let’s get started!

Load balancing in LLMs

Load balancing always needs a set of targets, and load balancing policy decides which target to choose for a given request. In traditional web services, these targets are web servers which often themselves are stateless or backed by some shared state, so a simple round robin or load-based routing policy works nicely.

LLMs present a unique set of challenges for load balancing. To properly understand this, we need to delve a bit deeper into how LLMs work (while an LLMs deep dive could be a blog of its own, someday I might take time to post something about it; we will keep this brief and simple).

In LLM serving, each request is processed as a result of two steps:

  1. Prefill: The prompt processing
  2. Decode: The output generation

The result of a prompt processing is a set of Key and Value tensors, which is usually cached by the inference engine because in an LLM conversation those allow us to avoid the prefill stage for each request. The inference engines are smart and have many optimizations including offloading KV caches, sharing, and efficiently using these caches for concurrent conversations.

The goal of our load balancing policy is then to possibly route the requests to an engine instance with the maximum KV Cache hit rate. KV Cache only depends on the body of the request. i.e the prompt. So our load balancing policy needs to inspect the body of the request to find an optimal target.

Routing data structure - Radix tree

Now that we have a better understanding of the sort of processing our load balancing policy needs to do, it is time for us to see how it’s done. As mentioned previously, the load balancing policy needs to inspect the body of each request and route the request to an engine with the highest matching request prefix.

For this, we need a data structure called Radix tree aka compact prefix tree, which allows us to efficiently find the highest matching prefix for a given request. For every target in our load balancer, we maintain this radix tree and calculate a match_rate based on it.

let result = tree.prefix_match_with_counts(prompt_text);
let match_rate = if result.input_char_count == 0 {
    0.0
} else {
    result.matched_char_count as f32 / result.input_char_count as f32
};

We also generally run a background thread to periodically evict the least recently used leaf nodes from our radix tree to prevent memory overflow.

Prefix Cache Aware Routing

Apart from just routing requests to most prefix-cache hit targets, it is also important for us to distribute the requests to other targets such that the requests are properly balanced and all targets are utilized. For this purpose, we generally route requests to also targets with no-prefix cache (such as new targets) or low prefix cache. We control this with a routing parameter called cache_threshold which controls the minimum prefix match required for routing to a target. If a match_rate is lower than this threshold, it is more optimal to choose a target with the maximum available KV cache capacity to properly distribute the load across all targets.

In addition to cache_threshold, we also have balance_threshold, which, regardless of prefix cache, first tries to balance the targets with requests, routing to the most underutilized targets. This is usually determined as the absolute or relative difference of the min and max load of all targets. If the system is determined to be imbalanced and prefix cache routing further imbalances the load, then we route to the most underutilized target instead of following the prefix cache-based policy.

let is_imbalanced = max_load.saturating_sub(min_load) > balance_threshold

In short, a prefix cache-aware routing is an amalgamation of two load balancing strategies and it dynamically switches between them as per load conditions.

  1. Uses load balancing when the system is imbalanced.
  2. Uses cache-aware routing when the system is balanced.

Paving the way for Precise-Prefix Cache Aware Routing

You might have noticed that our current load balancing strategy doesn’t take into account the actual state of KV caches in the targets. We evict the cache based on LRU principles from our policy, and that might not be true or based on the target memory resources the LLM engine might evict their caches much earlier or even retain them for a long time. We don’t get this input from the engines themselves, and this is a drawback of this policy, which is solved by another more effective load balancing strategy called Precise-prefix cache aware strategy - that would be a post for another day!