Blog

Inference is Underrated

Scan the AI headlines and you’d be forgiven for thinking the only thing that matters is the next training run. $5 billion clusters. Millions of GPU-hours. A new benchmark score that shatters the old one by a fraction of a percent.

For the engineers, architects, and operators actually *building* with AI, this narrative feels increasingly disconnected from reality. Training is a spectacle. But inference is the substance.

The industry’s obsession with training overlooks a fundamental truth: “training happens once. Inference happens forever.” Every user interaction, every API call, every automated task—it’s all inference. And as the recent spending spree on frontier model tokens proves, this is where the real economic action is.

The Market Has Spoken: Its About Tokens, Not Titans

While the research labs compete for the title of “smartest model,” the market has already voted with its wallet. And it’s voting for tokens.

Recent data on OpenAI’s largest customers reveals a fascinating picture. A leaked list from a 2025 OpenAI DevDay showcased over 30 companies consuming more than **one trillion tokens annually each** . At GPT-5 level pricing, that translates to over **$56 million per customer, per year** . These aren’t companies running weekend experiments. They’ve built their entire business model on inference.

Who’s buying all these tokens? A cross-section of the AI economy that tells you everything about where value is being created:

  • AI-Native Applications: Perplexity is reinventing search through complex inference chains. Duolingo personalizes lessons and generates interactive content for millions of daily users.
  • Developer Platforms: Cursor, the AI code editor that hit a $1B+ ARR run rate in record time, is a pure inference play. Its growth wasn’t fueled by a sales team, but by a magical product experience—powered entirely by tokens.
  • Vertical AI Agents: Harvey (legal) and Abridge (healthcare) are embedding inference directly into high-stakes professional workflows. A single legal brief or medical transcript can consume tens of thousands of tokens.
  • Established SaaS: SalesforceShopify, and Canva have integrated AI so deeply that their collective customer bases generate astronomical token volumes, making inference a core part of their product.

The Operational Challenge No One Talks About

Here’s the problem: while everyone was focused on model intelligence, we collectively under-invested in inference operability.

Running inference at scale is fundamentally different from running training jobs. Training is a batch process with clear start and end points. Inference is a continuous, stateful, latency-sensitive production workload. And the tools and practices are still catching up.

Consider what happens when you scale to billions of tokens:

Latency becomes a product feature. A 100ms difference in time-to-first-token can mean the difference between a user feeling like they’re talking to an AI or waiting for a database query. And as you scale, maintaining that latency requires increasingly sophisticated batching strategies, model parallelization techniques, and hardware placement decisions.

Cost-per-token becomes a competitive moat. The difference between a well-optimized inference stack and a naive implementation isn’t 10% or 20%—it’s often an order of magnitude. Companies like Together AI and Fireworks AI have built their entire business on this gap, offering faster and cheaper inference by obsessing over continuous batching, kernel fusion, and quantization.

Reliability becomes non-negotiable. When inference powers your core product, downtime isn’t just an operations issue—it’s a revenue issue. This demands real-time observability, automated failover, and sophisticated canary deployment strategies that don’t exist in the training world.

The New Discipline: InferenceOps

This is why InferenceOps needs to emerge as its own discipline. We can no longer treat inference as a black box that just “runs.” Every token has a direct financial signature tied to hardware utilization, model architecture, and serving stack efficiency.

The community is already moving in this direction. Projects like vLLM and TensorRT-LLM are pushing the boundaries of serving efficiency. The rise of KServe and the new “llm-d” initiative shows the intense focus on standardizing inference deployment. And research into architectures like BitNet promises to fundamentally change the cost equation by operating at 1-2 bits instead of 16 or 32.

As NVIDIA’s Dion Harris put it, a modern AI data center isn’t a cost center—it’s an **”AI factory” that produces intelligence in the form of tokens** . A well-architected inference stack can turn a $5 million hardware investment into $75 million in token revenue by maximizing asset productivity. The goal isn’t just to buy cheaper GPUs, but to drive down the cost-per-token to unlock entirely new applications.

The Real AI Economy Has Begun

The training arms race will continue. There will always be another model that’s slightly smarter than the last. But for the companies actually delivering value to users, inference is where the work—and the revenue—happens.

The next decade of AI won’t be defined by who built the smartest model. It will be defined by who could run it most reliably, most efficiently, and at the lowest cost-per-token. The ones who treat inference as a first-class operational discipline, not an afterthought.

The spectacle is over. The real work has begun.

Huzaifa Sidhpurwala

Author

Huzaifa Sidhpurwala

I work in Red Hat's Product Security AI team, mainly doing research in the field of AI security, safety and trustworthiness. I have studied AI security from Stanford and am a certified Trusted AI Safety Expert from Cloud…