Understanding CUDA Kernels

At work, we focus on optimizing LLM serving, and one topic that comes up repeatedly is kernel optimization. I want to share some insights into what a kernel actually is and where it fits in the stack, because believe it or not, every modern LLM and diffusion model is ultimately powered by kernels running on a GPU. I have some familiarity with Compute Unified Device Architecture (CUDA) and I also happen to have an NVIDIA Blackwell GPU in my workstation, so in this post I will explain what a kernel is and walk through writing one from scratch in CUDA. ...

April 5, 2026 · 13 min · 2606 words · coder3101