Revolutionary AI Model Runs on MacBook Pro, Challenging Data-Center Norms

March 22, 2026
Revolutionary AI Model Runs on MacBook Pro, Challenging Data-Center Norms
  • A frontier AI approach demonstrates on-device large model execution by streaming only four active experts per layer from a 209GB weight file, using SSD-based loading, an FMA-optimized 4-bit dequantization kernel, hand-written Metal shaders, fused activations, RMS normalization, and Batched GPU attention, with deferred GPU expert computation to overlap with CPU prep for the next layer.

  • Dan Woods ran the 397B Qwen3.5 model on a 48GB MacBook Pro with an M3 Max, achieving over 5.5 tokens per second and challenging the notion that such models require data-center hardware.

  • The effort relies on Apple’s 2023 research showing how to stream parameters from flash into RAM, aided by NVMe storage and the unified memory architecture of Apple silicon.

  • Rather than coding everything from scratch, Claude Code was used to translate Apple’s research into optimized Objective-C and Metal, with 90 automated experiments producing the final implementation, now open-sourced on GitHub.

  • A detailed quick-start guide and project structure outline what was kept and discarded, highlighting decisions like fused GPU operations, 64-head BLAS optimization, and avoiding bespoke caches to maximize performance.

  • Safety notes specify a memory budget of about 5.5GB for non-expert weights, ~200MB scratch, and ~6GB total, with no OOM risk due to streaming and reliance on OS caching rather than custom caches.

  • Experts caution that quality tradeoffs exist and evaluation is ongoing, though the core achievement demonstrates a foundational technique for efficient on-device AI with limited memory.

  • The project, flash-moe, demonstrates running a 397B Mixture-of-Experts model on a MacBook Pro with 48GB RAM using pure C/Metal without Python or other frameworks.

  • flash-moe includes both code and an AI-written technical paper detailing the experiments, offering a practical blueprint for running large models on consumer hardware.

  • Performance shows per-layer latency around 4.28ms at 4-bit precision, with a production setup achieving 4-bit inference and tool calling on a 209GB on-disk model; 2-bit configurations are faster but disable tool calling.

  • The system leverages unified memory and SSD DMA on Apple Silicon, relying on the OS page cache for expert weight caching with about a 71% hit rate, rather than a custom cache.

  • Woods’s approach uses a Mixture of Experts to activate fewer parameters per token and stream the rest from storage, reducing active experts per token from 10 to 4 to balance quality and memory usage.

Summary based on 2 sources


Get a daily email with more Startups stories

More Stories