Blog
What makes Inference inevitable
What makes Inference inevitable
Serving large language models is fundamentally different from running traditional software. The core challenge is autoregressive generation — models produce output one token at a time, with no way to know in advance how long the response will be. This makes memory and compute costs highly variable and difficult to predict, particularly because the KV (key-value) cache — which stores intermediate attention state — grows continuously throughout a response.
Practitioners measure inference quality through three key metrics: TTFT (Time to First Token), TBT (Time Between Tokens), and Goodput (the proportion of requests that meet service-level objectives). Meeting these constraints requires a combination of specialized scheduling, intelligent batching, kernel-level optimization, and careful memory management.
Key Components of LLM Inference
Request Processing covers the mechanics of token generation. This includes attention mechanisms (multi-head, grouped/multi-query, sparse, and shared variants), feed-forward networks and Mixture-of-Experts (MoE) layers, and token sampling strategies like top-k, nucleus sampling, and temperature scaling. More advanced techniques — speculative decoding, beam search, chain-of-thought, and self-consistency — can improve output quality and speed, but each has downstream implications for system design.
Model Execution Optimization
Specialized Kernels are critical for maximizing GPU utilization. Techniques like FlashAttention and FlashDecoding use blockwise fused kernels to reduce memory bandwidth overhead. Kernel fusion minimizes memory reads and writes, tiled matrix multiplication keeps GPU streaming multiprocessors busy, and Ring Attention distributes attention computation across GPUs to support very long contexts.
Batching strategies vary widely in complexity and efficiency. Static batching is simple but can create bottlenecks when requests finish at different times. Continuous (dynamic) batching re-evaluates batch composition at every decode step, dramatically improving throughput. Chunked prefills allow long prompts to be processed in segments without blocking the rest of the system.
Scheduling and Load Balancing is complicated by the fact that request lengths are inherently unpredictable. Classic algorithms like FCFS and SJF have been adapted for LLMs, alongside multi-level queue strategies, cache-aware routing, dynamic replica rebalancing, and even learned models (MLP or LLM-based) that predict request length ahead of time.
Memory Management
Memory is the dominant bottleneck in LLM serving. Several innovations address this:
Paged Attention (pioneered by vLLM) allocates KV cache in small fixed-size blocks rather than large contiguous reservations, enabling cache sharing, on-demand offloading, and persistence across requests.
Eviction and Offloading strategies are used when context windows grow very long or when requests are preempted. Cache blocks are evicted based on token position, attention scores, or accumulated importance, and can be offloaded to CPU or SSD with asynchronous recovery to hide latency.
Quantization reduces the memory footprint of model weights and activations. Approaches range from tensor-wise and vector-wise to dimension-wise quantization, with mixed precision and smoothing techniques to protect numerical outliers. The result is the ability to run larger models on hardware that would otherwise fall short.
Cache Persistence is particularly valuable for shared system prompts, RAG document chunks, and multi-turn agent pipelines. Prefix matching and selective KV reconstruction allow previously computed state to be reused, reducing redundant computation.
LLM Inference Systems
Systems can be organized into three tiers:
Single-replica systems — such as vLLM, FasterTransformer, and LightSeq — are optimized for deployment on a single GPU or machine. Their strengths lie in paged attention, fused kernels, and efficient batching.
Multi-replica and distributed systems — including SGLang, Mooncake, DeepFlow, Orca, and Sarathi-Serve — are designed for large-scale serving. Key capabilities include disaggregated prefill and decode workers, load-aware and cache-aware routing, and serverless execution models.
Frontends like LMQL, DSPy, SGLang, and LangChain provide structured programming interfaces for building LLM-powered applications. They support structured outputs (e.g., JSON), templated prompt completion, prompt optimization, and multi-step LLM workflows.
Feedback
Share feedback on this post.
Add a correction, ask a question, or share what worked in your own production environment.