Tell me the architectural details about the on device speculative decoding draft model

Asked on 06/11/2025

1 search

The architectural details about the on-device speculative decoding draft model were discussed in the "Platforms State of the Union" session at WWDC 2024. The session highlighted the use of speculative decoding as part of optimizing inference performance and efficiency for on-device models. This technique, along with context pruning and group query attention, was tuned to maximize the capabilities of the neural engine, allowing for efficient processing of prompts and responses on-device. These optimizations are part of Apple's broader strategy to deliver powerful, intuitive, and integrated language and diffusion models that can run efficiently on devices.

For more details, you can refer to the Platforms State of the Union (00:04:37) session.