University of Maryland team embeds 3x LLM inference speed into model weights

🇺🇸United States

Machine LearningAI InfrastructureCloud ServicesResearch

Mon, Feb 23, 2026

Key result

A research group led by the University of Maryland published a training and decoding recipe that bakes multi-token emission ability directly into existing language model weights, producing roughly a 3x speedup on inference in experiments while keeping accuracy losses small.

Mechanics at a glance

The method pairs a student that emits token blocks in parallel with a strong next-token teacher that scores those blocks, effectively turning generation into an on-policy, self-distillation loop rather than static supervised regression.

Decoding control

An adaptive decoder, branded ConfAdapt, keeps only high-confidence subsequences (the paper cites a ~90% example threshold), allowing large, safe multi-token emits where entropy is low and reverting to single-token passes on uncertain spans.

Practical testbed and results

Applied to instruction-tuned open models, the approach gave the Llama-3.1-8B a ~3x acceleration with under a 3% accuracy decrease on math benchmarks, and the Qwen3-4B the same throughput gain with about a 7% drop; more aggressive settings approached 5x at greater quality cost.

Minimal integration friction

Engineers can adapt production models by repurposing one unused embedding slot as an MTP mask token, requiring only one-time changes to batching and KV cache handling in serving stacks rather than new auxiliary drafting models or complex inference pipelines.

Domain sensitivity and transfer

Although speed benefits transferred to out-of-training domains like summarization and creative writing, the authors recommend MTP fine-tuning on deployment-specific prompts to regain lost accuracy for specialized industrial tasks.

Why this matters now

As agentic workflows and ultra-long reasoning traces make latency a first-order cost, converting some inference complexity into model parameters offers a complementary path to existing inference hacks and speculative decoders.

Operational caveats

Teams should expect a trade-off surface: faster throughput for easier subsequences, extra engineering around KV/batch handling once, and domain adaptation work to avoid degenerate repetition or grammatical mismatch on low-confidence stretches.

Availability

The group published models on Hugging Face and will open-source the MTP framework code, lowering the barrier to experimentation inside vLLM-style serving stacks.

PREMIUM ANALYSIS

Read Our Expert Analysis

Create an account or login for free to unlock our expert analysis and key takeaways for this development.

By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.

Free Access

No Payment Needed

Join Thousands of Readers

Recommended for you

Startups & Venture

Inception unveils Mercury 2 to speed and cut cost of text AI

Inception is launching Mercury 2, a text model that applies diffusion techniques to process multiple tokens at once, targeting lower latency and inference cost for chat agents. The approach challenges autoregressive sequencing and could pressure cloud inference economics and LLM infrastructure in the next 6–12 months.

AI & Technology

Alibaba Qwen3.5: frontier-level reasoning with far lower inference cost

Alibaba’s open-weight Qwen3.5-397B-A17B blends a sparse-expert architecture and multi-token prediction to deliver large-context, multimodal reasoning at sharply lower runtime cost and latency. The release — permissively Apache 2.0 licensed and offering hosted plus options up to a 1M-token window — pushes enterprises to weigh on-prem self-hosting, in-region hosting, and new procurement trade-offs around cost, sovereignty and operational maturity.

AI & Technology

Blackwell delivers up to 10x inference cost cuts — but software and precision formats drive the gains

Nvidia-backed production data shows that pairing Blackwell GPUs with tuned software stacks and open-source models can lower inference costs by roughly 4x–10x. The largest savings come from adopting low-precision formats and model architectures that exploit high-throughput interconnects rather than hardware improvements alone.

AI & Technology

Self-distillation lets LLMs acquire new skills without erasing old ones

A team including researchers from MIT and ETH Zurich introduced self-distillation fine-tuning (SDFT), a training pipeline that creates an internal teacher–student loop so large language models can learn new tasks without degrading prior abilities. Tests on open-weight models show measurable accuracy gains on new tasks and strong retention of previous capabilities, at the cost of higher compute and slower training.

AI & Technology

Nvidia’s Dynamic Memory Sparsification slashes LLM reasoning memory costs by up to 8x

Nvidia researchers introduced Dynamic Memory Sparsification (DMS), a retrofit that compresses the KV cache so large language models can reason farther with far less GPU memory. In benchmarks DMS reduced cache footprint by as much as eightfold, raised throughput up to five times for some models, and improved task accuracy under fixed memory budgets.

AI & Technology

Luma AI's Uni-1 Upsets Image-Model Hierarchy, Pressures Big Labs

Luma AI introduced Uni-1 , a token-driven image model that combines understanding and generation and posts top scores on reasoning benchmarks while undercutting rivals on 2K pricing. Enterprises that convert creative pipelines to models like Uni-1 will see faster, cheaper asset production and will force incumbents to respond on architecture, pricing, and platform integrations.

AI & Technology

Internal debates inside advanced LLMs unlock stronger reasoning and auditability

A Google-led study finds that high-performing reasoning models develop internal, multi-perspective debates that materially improve complex planning and problem-solving. The research implies practical shifts for model training, prompt design, and enterprise auditing—favoring conversational, messy training data and transparency over sanitized monologues.

Startups & Venture

MiniMax’s M2.5 slashes AI costs and reframes models as persistent workers

Shanghai startup MiniMax unveiled M2.5 in two flavors, claiming near–state-of-the-art accuracy while cutting consumption costs dramatically and enabling sustained, low-cost agent deployments. The release couples a sparse Mixture-of-Experts design and a proprietary RL training loop with aggressive pricing, but licensing and weight availability remain unresolved.