Nvidia’s Dynamic Memory Sparsification slashes LLM reasoning memory costs by up to 8x

🇺🇸United States

Artificial IntelligenceCloud InfrastructureSemiconductorsEnterprise Software

Thu, Feb 12, 2026

InsightsWire News2026

Nvidia has unveiled Dynamic Memory Sparsification (DMS), a retrofit technique that rethinks how LLMs manage transient working memory, enabling substantially longer or broader chains of reasoning without proportionally increasing GPU memory demands. DMS trains models to flag which past tokens should be retained versus discarded, then uses a short delayed-eviction window so remaining context can be integrated rather than prematurely lost — a pragmatic compromise that reduces harmful information loss. Because DMS is applied as a lightweight retrofit on top of pre-trained weights, teams can add it with modest compute: roughly one thousand fine-tuning steps that can be completed within hours on DGX-class hardware. In practical tests the method compressed the key-value (KV) cache by up to eight times, moving the cost-performance frontier and letting models explore more reasoning paths on the same hardware. Evaluations across math, science, and coding benchmarks showed not just parity but improvements in constrained-memory scenarios — for example, double-digit point gains on a math benchmark when compared at equal memory budgets. Smaller caches reduce memory traffic, cutting latency and freeing GPUs to spend more cycles on computation; Nvidia reported throughput increases as high as fivefold on an 8B model while matching baseline accuracy. Importantly, DMS is designed to integrate into existing inference stacks (it is available in Nvidia’s KVPress and works with FlashAttention-style kernels) so adoption avoids heavy CUDA-level engineering. Complementary infrastructure and software moves can multiply DMS’s system-level benefits: Nvidia’s production reviews show that combining modern Blackwell-class accelerators, precision-format choices, and tightly co-designed inference stacks often delivers 4×–10× reductions in per-token costs in real deployments, with raw hardware upgrades typically supplying around a 2× gain and software/precision supplying the remainder. The mix of gains depends on workload and model topology — dense models, mixture-of-experts architectures, and high token-per-query applications capture different shares of value — so teams should evaluate DMS as one axis in a broader optimization program. Practically, operators can stage adoption: apply DMS to existing models, measure throughput and per-token economics under representative traffic, then pair with precision-format changes or hardware upgrades where appropriate. Looking ahead, DMS is presented as a modular memory-policy layer that can be stacked with architectural advances and system-level optimizations to compound efficiency gains and lower the infrastructure cost of deeper reasoning agents.

PREMIUM ANALYSIS

Read Our Expert Analysis

Create an account or login for free to unlock our expert analysis and key takeaways for this development.

By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.

Free Access

No Payment Needed

Join Thousands of Readers

Recommended for you

AI & Technology

MIT’s Attention Matching Compresses KV Cache 50×

Attention Matching compresses KV working-memory by about 50× using fast algebraic fits that preserve attention behavior, running in seconds rather than hours. Complementary approaches—Nvidia's Dynamic Memory Sparsification (up to ~8× via a lightweight retrofit) and observational-memory patterns at the orchestration layer—offer different trade-offs in integration cost, compatibility, and worst-case fidelity.

AI & Technology

Memory, Not Just GPUs: DRAM Spike Forces New AI Cost Playbook

A roughly 7x surge in DRAM spot prices has pushed memory from a secondary expense to a primary cost lever for AI inference. Combined hardware allocation shifts by chipmakers and emerging software patterns—like prompt-cache tiers, observational memory, and techniques such as Nvidia’s Dynamic Memory Sparsification—mean teams must pair procurement strategy with cache orchestration to control per-inference spend.

Startups & Venture

Observational memory rethinks agent context: dramatic cost cuts and stronger long-term recall

A text-first, append-only memory design compresses agent histories into dated observations, enabling stable prompt caching and large token-cost reductions. Benchmarks and compression figures suggest this approach can preserve decision-level detail for long-running, tool-centric agents while reducing runtime variability and costs.

AI & Technology

Nvidia Nemotron-Cascade 2: Post‑Training Playbook Upsets Size Orthodoxy

Nvidia’s Nemotron-Cascade 2 uses a sequential post-training recipe to deliver top-tier math and coding performance while activating only 3B parameters at inference. The Cascade RL pipeline plus MOPD token-level distillation signals a shift toward intelligence-density strategies that cut serving cost and raise the value of training orchestration. Public materials across the Nemotron family sometimes report divergent headline sizes, a difference that likely reflects measurement conventions rather than an architectural contradiction.

AI & Technology

OpenAI’s Reasoning-Focused Model Rewrites Cloud and Chip Economics

OpenAI is moving a new reasoning-optimized foundation model into product timelines, privileging memory-resident, low-latency inference that changes instance economics and supplier leverage. Hardware exclusives (reported Cerebras arrangements), a sharp DRAM price shock and retrofittable software levers (eg. Dynamic Memory Sparsification) together create a bifurcated market where hyperscalers, specialized accelerators and neoclouds each capture different slices of growing inference value.

AI & Technology

Blackwell delivers up to 10x inference cost cuts — but software and precision formats drive the gains

Nvidia-backed production data shows that pairing Blackwell GPUs with tuned software stacks and open-source models can lower inference costs by roughly 4x–10x. The largest savings come from adopting low-precision formats and model architectures that exploit high-throughput interconnects rather than hardware improvements alone.

AI & Technology

Nvidia unveils DLSS 5 and pushes generative rendering beyond games

Nvidia announced DLSS 5, a hybrid rendering pipeline that uses structured 3D inputs plus generative models to cut per-frame raster work while boosting apparent fidelity. The move fits into a broader industry split between 'engine-less' generative stacks (startups claim dramatic cost savings but face perceptual and continuity limits) and Nvidia’s platform play that pairs new inference-optimized silicon and an agent/agent-tooling roadmap to capture recurring inference revenue.

AI & Technology

Google and NVIDIA Back New Memory Fabric That Reconfigures Servers

Google and NVIDIA have moved a coherent, pooled memory fabric from prototype toward productization, prompting hyperscalers to redesign node roles and procurement specs. Upstream supply shocks—large DRAM price moves, HBM prioritization and tooling partnerships—both accelerate the rationale for fabrics and complicate near‑term deployment and component availability.