From Weights to Token - Qwen3 Implementation

Hugging Face is the GitHub of AI models. It uses git with LFS (and xet) to serve large model weights, but at the end of the day, you can’t use those weights directly. The code required to run them is typically implemented in popular Python packages like transformers (for LLMs) or diffusers (for diffusion models). In this blog post, I will explain how to take a model weights file and use it to generate your very first token using your own custom inference implementation. Let’s get started—this will be quite technical and will likely involve some math! While it is always helpful to have a background in AI and ML, you can still follow along without it. ...

January 26, 2026 · 13 min · 2640 words · coder3101

Understanding Load balancing in LLMs

With the rise of LLMs and inference workloads, a new type of routing has gained traction. In this post, I will share some insights on how this works and why it’s needed. It will be a technical deep dive, so let’s get started! Load balancing in LLMs Load balancing always needs a set of targets, and load balancing policy decides which target to choose for a given request. In traditional web services, these targets are web servers which often themselves are stateless or backed by some shared state, so a simple round robin or load-based routing policy works nicely. ...

January 14, 2026 · 4 min · 796 words · coder3101