Modern large language models (LLMs) are defined by their scale. GPT-3 introduced 175 billion parameters in 2020, and today, production-grade models routinely operate in the hundreds of billions, with some architectures exceeding one trillion. Each parameter represents a learned weight, collectively encoding the language, reasoning, and knowledge that make these systems capable.

This scale is not incidental. Empirical research has consistently demonstrated that model capability scales predictably with size—larger models exhibit stronger reasoning, broader factual recall, and greater generalization. For many production use cases, these properties are non-negotiable, but scale comes at a cost.

The inference problem

Training a large model is a one-time expenditure. Inference—generating responses from a trained model—is a recurring cost paid for with every request, at every hour, across every user interaction. At enterprise scale, inference is where the economics of AI are made or broken.                                                                    

In particular, LLMs impose steep operational demands across 3 dimensions:                                                                                                                                    

   

  1. Hardware availability: Billions of parameters require high-end accelerators—often multiple GPUs—just to fit the weights in memory, locking organizations into expensive, specialized infrastructure. 
  2. Latency: Generation is autoregressive and memory-bandwidth-bound, meaning response time scales directly with sequence length, introducing friction in any interactive application.                     
  3. Cost: The result is recurring expenditure that scales with every request. Organizations serving millions of daily interactions routinely see inference bills dwarf their original training costs.                                    

The result is a structural tension at the heart of modern AI deployment: the models that perform best are precisely the ones that are most expensive and slowest to run.

Solving the problem through speculative decoding

Speculative decoding is a technique designed to help alleviate this tension—delivering the output quality of a large model while substantially reducing latency and cost.

Rather than generating every token through the LLM alone, speculative decoding uses a smaller, faster draft model (the "speculator") to propose candidate tokens in advance. The LLM (the "verifier") then verifies those candidates in a fraction of the time it would take to generate them independently, effectively producing multiple tokens for the cost of one. Critically, this verification is not an approximation. It provides near lossless speed-up and essentially matches what the target model would have produced on its own, with minimal  degradation in quality, and little change in behavior. Published research has demonstrated end-to-end speedups of 2 to 3 times in real-world settings.

Introducing Speculators

This is precisely the problem that Speculators was built to solve.

Speculators is an open source library that brings speculative decoding from research into production. It provides a unified framework for training speculative decoding algorithms, designed to integrate seamlessly with inference infrastructure including vLLM. Rather than requiring organizations to implement and validate these techniques from scratch, Speculators packages the hard work—algorithm definitions, draft model tooling, and the serialization formats needed to move from experimentation to deployment with confidence.

The result is a clear path for any team running LLMs at scale to meaningfully reduce inference latency and cost—without compromising on model quality.

Why this matters

For organizations deploying frontier models at scale, speculative decoding offers a path to lower latency and cost without the risk of model regression—a trade-off that would otherwise require selecting a smaller, less capable model. Applications that depend on real-time responsiveness become significantly more viable when generation time is materially reduced.

More broadly, speculative decoding represents a maturing in how the field approaches inference. The early era of LLM deployment was defined by making models capable. The current era is defined by making capable models practical. Speculators is built for exactly that challenge.

 

This demo runs 2 vLLM-hosted endpoints side by side—a baseline Qwen3 8B model and a Qwen3 8B model accelerated with DFlash speculator. By having a lightweight speculator model propose candidate tokens that the verifier accepts or rejects, speculative decoding reduces inter-token latency and increases throughput without altering the underlying generation distribution. As shown across the provided prompts, the speculated model maintains output quality and accuracy identical to the baseline—demonstrating that speculative decoding is a production-ready strategy for serving LLMs  more efficiently. 

Performance

Speculative decoding is extremely effective in certain situations, and its effectiveness is determined by 2 primary factors, the cost incurred to draft and verify the candidate tokens, and the likelihood that these tokens will be accepted by the verifier, which is what ultimately saves time. The cost to draft tokens is worthwhile if they are likely to be accepted, since we get multiple tokens for the cost of running the verifier once. Conversely, if the drafted tokens are unlikely to be accepted, the cost is not justified and can actually slow down the process.

The acceptance of tokens is influenced by multiple factors. One factor is how well the task matches the data the speculator was trained on. This means that matching chat templates and training data distribution is helpful. For example, speculators trained almost exclusively on English text are likely to perform poorly on other languages unless fine-tuned. Our general-purpose models can be used for many tasks, and Speculators allows you to fine tune them to improve acceptance rates, or create your own models from scratch. Context length is another factor, and work is underway to improve performance at longer context lengths. Additionally, some tasks are inherently easier to speculate for than others, for instance, coding is highly predictable, while creative writing is less so. To quantify the potential benefit in a specific use case, one should examine the acceptance rates for each token, as a higher rate translates to a larger expected speedup.

The relative cost to generate tokens can depend on whether the generation bottleneck is compute or memory movement, which is affected by the context length and the number of requests running through the server. If a server is handling many requests and the acceptance length is low, speculators can potentially slow down the server. However, a high acceptance rate combined with a moderate request rate can deliver giant speedups, as shown below. 

Performance case study: Gemma 4 

Mean DFlash Acceptance Length:  4.91 Mean DFlash Acceptance Length: 2.53

Mean DFlash Acceptance Length:  4.91

Mean DFlash Acceptance Length: 2.53

Figure 1:  Performance results for 2 speculative models (DFlash and Eagle 3) trained using Speculators to speed up Gemma 4 31B. In the first case, we look at the HumanEval coding dataset.  Our DFlash speculator provides a 4x reduction in inter-token latency (ITL) fairly consistently across different request rates, which we can predict since the acceptance length is quite long. However, for summarization tasks the speedup is dependent on the request rate. Since the acceptance length is lower,  the speedup we get from speculation is much smaller at higher request rates. With the full-sized large model as a verifier, the speedup comes at no cost—the output is guaranteed to be the same as the large model by itself. Our speculator models can also be used with quantized models created with llm-compressor as the verifier instead, for huge, nearly lossless speedups!    

How do I use Speculators? 

Speculators is generally available (GA) across the Red Hat AI platform as of the 3.4 release. You can deploy it in production through Red Hat AI Inference , or work with it interactively through the Red Hat OpenShift AI workbench image, giving teams a path from experimentation to production without leaving the platform.

For organizations that want acceleration immediately, Red Hat publishes a growing collection of pre-trained speculator models on Hugging Face under the RedHatAI organization. These cover the model families most commonly deployed in production today, including Llama 3.1 and 3.3, the full Qwen3 family (dense, MoE, and vision-language variants up to 235B parameters), gpt-oss at both 20B and 120B, and gemma 4 31B and 26B—all trained end-to-end with the Eagle 3 algorithm and ready to serve with a single vLLM command.

For teams whose workloads sit outside the set of pre-trained models, or whose domain-specific data warrants a custom draft model, Speculators provides the full end-to-end training pipeline as an open source library under Apache 2.0. This includes offline data generation, draft model training for both dense and mixture-of-experts (MoE) architectures, and serialization into a HuggingFace-compatible format that deploys directly into vLLM. Native integration of these training and fine-tuning flows into Red Hat OpenShift AI is on the roadmap for later this year, bringing the same workflow into a managed platform experience.

Get started

The fastest path to measurable latency improvements is to try the pre-trained speculators on the Red Hat AI platform today. The speculative decoding documentation walks through deployment, and the Speculators GitHub repository is the home for community discussion, issue tracking, and contributions to the library itself.

For organizations interested in a proof-of-concept engagement around training and serving custom speculators—whether to target an in-house base model, optimize for a specific domain, or quantify potential GPU savings against current inference costs—Red Hat offers a structured services engagement that takes teams from initial benchmarking through production deployment. The models that perform best no longer have to be the ones that are slowest to run.

Reach out to your Red Hat account team to scope an engagement and learn more about Speculators. 

Recurso

La empresa adaptable: Motivos por los que la preparación para la inteligencia artificial implica prepararse para los cambios drásticos

En este ebook, escrito por Michael Ferris, director de operaciones y director de estrategia de Red Hat, se analiza el ritmo de los cambios y las disrupciones tecnológicas que produce la inteligencia artificial y a los que se enfrentan los líderes de TI en la actualidad.

Sobre los autores

My name is Rob Greenberg, Principal Product Manager for Red Hat AI, and I came over to Red Hat with the Neural Magic acquisition in January 2025. Prior to joining Red Hat, I spent 3 years at Neural Magic building and delivering tools that accelerate AI inference with optimized, open-source models. I've also had stints as a Digital Product Manager at Rocketbook and as a Technology Consultant at Accenture.

I am a scientist specialized in the development of numerical models and computational algorithms. My expertise is bridging the gap between academic research and industry innovation.

I am currently the Manager of Machine Learning Research at Red Hat, joining Red Hat as part of the acquisition of Neural Magic. I am proud to lead a world-class team of researchers and engineers that focuses on algorithms for AI inference optimization. We specialize in sparsity, quantization, knowledge distillation and speculative decoding. We work hand-in-hand with the vLLM developers to deliver optimizations that lead to real-world performance gains.

Dipika is a Principal Software Engineer at Red Hat, working on LLM Compressor, compressed-tensors, and its integration into vLLM

UI_Icon-Red_Hat-Close-A-Black-RGB

Navegar por canal

automation icon

Automatización

Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos

AI icon

Inteligencia artificial

Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar

open hybrid cloud icon

Nube híbrida abierta

Vea como construimos un futuro flexible con la nube híbrida

security icon

Seguridad

Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías

edge icon

Edge computing

Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge

Infrastructure icon

Infraestructura

Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo

application development icon

Aplicaciones

Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones

Virtualization icon

Virtualización

El futuro de la virtualización empresarial para tus cargas de trabajo locales o en la nube