BF16 and Image Generation Models

Apr 30, 2025

Draw Things maintains its own local inference and training stack for image generation models. We support diffusion transformer models ranging from small (SD Medium 3.5, a 2.5B parameter model), to medium (FLUX.1, a 12B parameter model), to large-scale models like HiDream I1 (17B parameters).

One area that has received little attention is the activation dynamics of diffusion transformers as they deepen. Architectures — particularly FLUX.1, a variant of MMDiT — tend to progressively increase activation scale. A common solution is to use BF16, which has a much larger dynamic range. This is one reason BF16 has gained popularity in image generation models.

However, BF16 brings its own challenges. Its reduced mantissa (compared to FP16) can lead to accuracy issues, and support for BF16 on older Apple Silicon (M1/M2) is limited. Software emulation only matured in macOS 15 — and even then, it’s roughly 50% slower than FP16 on these platforms.

Over the past year, Draw Things has refined its FP16 support to enable efficient execution of large diffusion transformer models on M1/M2, often achieving performance comparable to M3/M4. In this post, we’re sharing our general approach and model-specific tuning strategies to make FP16 viable. Our hope is to help extend support for cutting-edge models to a wider range of edge devices — especially for users without hardware-accelerated BF16 or who find BF16 accuracy unsatisfactory.

FP32

In diffusion transformers, a key challenge lies in the final layer normalization prior to projecting back into the latent space. This layernorm allows the preceding activations to scale freely, often beyond FP16’s representable range.

To address this, we upcast the main activation accumulation path to FP32. This sidesteps dynamic range limitations without significant performance cost — since element-wise operations (including layernorm) are not the main bottleneck in image or video generation.

FP16 & Transformer Block

For all MMDiT / DiT variants we’ve encountered, the main activation is routed through an adaptive layernorm before entering the transformer block. This presents a clean boundary where we can safely convert activations to FP16 and run the rest of the block in FP16.

For many models — such as the Wan 2.1 series — this is sufficient. But in some MMDiT variants with large MLP intermediate dimensions, additional care is needed.

Scaling in MLP Layers

The MLP layers often include a wide projection before collapsing back to the hidden dimension. The down-projection can lead to activation overflow in FP16. To mitigate this, we apply conservative scaling factors (typically ⅛ or ¼).

These factors are conservative by design: FP16’s 10-bit mantissa offers ~3 bits more precision than BF16, so a scaled FP16 activation (e.g., by ⅛) often retains more numerical fidelity than an unscaled BF16 value.

In these paths, we upcast to FP32 after the final GEMM in the MLP and perform the adaptive layernorm gating in float32.

Attention Scaling Strategy

Attention operations typically apply a scaling factor of 1/√k. In our FlashAttention implementation, accumulation occurs in FP16, which may still result in range issues.

We’ve found that applying the scaling factor before the attention (rather than fusing it inside the kernel) helps mitigate overflow and preserves numerical stability for our particular implementation.

Annotated DiT block with FP32 / FP16 mixed precision.

Exact Configurations

Below are the exact FP16 tuning configurations we use in Draw Things for various models. These adjustments allow for stable FP16 inference without requiring full BF16 support:

FLUX.1

Activation scaling factor: 8
Scaled layers: Double stream blocks 17, 18 (0-indexed)
Implementation Details

Hunyuan

Activation scaling factor: 8
Scaled layers: All double stream blocks
Implementation Details

Wan 2.1 14B

Activation scaling: Not needed
Pre-scaling: Applied for attention
Implementation Details

HiDream

Activation scaling factor: 4
Scaled layers: Double stream blocks 13, 14, 15 (0-indexed)
Pre-scaling: Applied for attention
Implementation Details

Engineering @ Draw Things

Discussion about this post