Understanding CUDA Kernels

Sun, 05 Apr 2026 00:00:00 +0500

At work, we focus on optimizing LLM serving, and one topic that comes up repeatedly is kernel optimization. I want to share some insights into what a kernel actually is and where it fits in the stack, because believe it or not, every modern LLM and diffusion model is ultimately powered by kernels running on a GPU.

I have some familiarity with Compute Unified Device Architecture (CUDA) and I also happen to have an NVIDIA Blackwell GPU in my workstation, so in this post I will explain what a kernel is and walk through writing one from scratch in CUDA.

Cuda on Bits and Bytes

Understanding CUDA Kernels