Tell me the architectural details about the on device speculative decoding draft model

Asked on 2025-06-11

1 search

Tailscan

This post is brought to you by Tailscan,
the best devtool for Tailwind CSS.

Want to sponsor? We'd love to chat →

The architectural details about the on-device speculative decoding draft model were discussed in the "Platforms State of the Union" session at WWDC 2024. The session highlighted the use of speculative decoding as part of optimizing inference performance and efficiency for on-device models. This technique, along with context pruning and group query attention, was tuned to maximize the capabilities of the neural engine, allowing for efficient processing of prompts and responses on-device. These optimizations are part of Apple's broader strategy to deliver powerful, intuitive, and integrated language and diffusion models that can run efficiently on devices.

For more details, you can refer to the Platforms State of the Union (00:04:37) session.

Meet the Foundation Models framework

Meet the Foundation Models framework

Learn how to tap into the on-device large language model behind Apple Intelligence! This high-level overview covers everything from guided generation for generating Swift data structures and streaming for responsive experiences, to tool calling for integrating data sources and sessions for context management. This session has no prerequisites.

Platforms State of the Union

Platforms State of the Union

Discover the newest advancements on Apple platforms.

Get to know the ManagedApp Framework

Get to know the ManagedApp Framework

Discover how the ManagedApp framework helps your app adapt to managed environments. We’ll show you how to receive configuration data, manage app secrets securely, and tailor your app’s behavior based on organization-provided settings. We’ll also walk through real-world examples to show how you can build more flexible, manageable apps for enterprise and education environments.