Revolutionary 'Compute It Once' Paradigm Slashes AI Costs by 50x with Precomputed Caches

June 14, 2026
Revolutionary 'Compute It Once' Paradigm Slashes AI Costs by 50x with Precomputed Caches
  • The paper envisions an agent-native CDN for prefill, while acknowledging unresolved issues such as lossless KV compression and a cross-party payment/royalty mechanism.

  • Loadable KV caches face practical hurdles because KV data is nearly incompressible, making per-load egress costs potentially higher than prefill savings; hosting caches provider-side can mitigate this by removing egress.

  • The work lays groundwork for an agent-native CDN architecture that reduces redundant computation and scales, yet faces challenges including compression and cross-party monetization.

  • Proposals include shipping KV caches via provider-side hosting (like prompt caching) to avoid egress costs and enable a vendor-friendly model through API tariffs for cache reads.

  • Publishers would precompute a document’s KV cache and allow agents to load it on demand, achieving exact equivalence to on-the-fly prefill in tokens.

  • Experiments with Qwen3-4B show reuse is 9-50x cheaper in compute than redoing prefill, with bigger savings for longer documents due to prefill’s O(L^2) attention cost.

  • The Compute It Once paradigm proposes precomputing a document’s key-value caches once and licensing their use to multiple AI agents, eliminating repeated prefill steps and enabling token-exact results.

  • This approach aims to optimize AI agent performance by monetizing precomputed KV caches, reducing redundant computation across agents.

  • Reusing precomputed KV caches can deliver token-exact results with no accuracy loss, enabling substantial compute savings and better scalability as document length increases.

  • A concrete example shows serving a 3,774-token document to 80 million agents costs about $1.5 million with re-prefill versus only about $0.03 million with reuse, illustrating nearly 50x savings.

  • On models like Qwen3-4B, reuse yields substantial compute savings, with the efficiency gap widening as document length grows because of quadratic attention scaling in prefill.

  • ArXiv publication details the concept and notes related startup activity and potential implications for AI agent deployment and cost efficiency.

Summary based on 2 sources


Get a daily email with more Startups stories

Sources

Can I Buy Your KV Cache?

arXiv.org • Jun 11, 2026

Can I Buy Your KV Cache?

Compute Once: Unlocking AI Agent Efficiency

StartupHub.ai • Jun 13, 2026

Compute Once: Unlocking AI Agent Efficiency

More Stories