Draw Things Democratizes Local Large Model Fine-tuning on iPhone, iPad and Mac
Large language models and image generation models currently require hundreds of thousands to tens of millions of dollars for hardware acquisition and training. As a result, fine-tuning these pre-trained models — whether to introduce new concepts or adapt them to specific tasks — has become an active research area over the past two years. Techniques like LoRA, QLoRA, and 8Bit-Adam have made it feasible to fine-tune large models on consumer PCs. CUDA-based software, such as bitsandbytes, has been integrated into numerous open-source packages, facilitating fine-tuning for those with NVIDIA hardware.
However, since these advancements were primarily within the CUDA ecosystem, model fine-tuning was largely exclusive to powerful NVIDIA servers and PCs.
With the release of Draw Things version 1.20231004.1, we’ve extended the capability to fine-tune large image generation models like Stable Diffusion v1, v2, and XL to iPhone, iPad, and Mac.
Benefits of On-device Fine-tuning
On-device large model fine-tuning offers even more possibilities for AI-assisted creative workflows. Whether you’re using 3 to 4 photos to introduce a new identity to the model or hundreds of your artworks to teach it a new style, on-device fine-tuning ensures privacy and offers limitless customization. Fine-tuning on personal hardware provides a broader range of choices, from base model selection and image captioning to denoising schedules and learning rates. These training recipes are essential components of the creative process, not just for experimentations.
The Path to On-device Fine-tuning
Draw Things adopted a specific fine-tuning strategy known as LoRA. Specifically, our LoRA approach overlays the LoRA network on both linear and convolution layers (known as LoCon in Stable Diffusion community).
Our LoRA method also builds upon our Metal FlashAttention and JIT weight dequantization work, enabling us to train LoRA on quantized model weights directly, known as QLoRA.
For the first time ever, this setup allows the SD v1 model to be fine-tuned (at a 512x512 resolution) on an iPhone 15 Pro, with peak memory consumption around ~6GiB, including model weights. For the 3.5B parameter SDXL, our approach uses peak memory of approximately ~10.3GiB, making it possible to fine-tune such a large model on an iPad.
While the main network operates at FP16, the LoRA network runs at FP32 during training. This distinction stabilizes the training process, even with higher learning rates (up to 1e-3). Test users have reported that as few as 500 steps at learning rates of 1e-4 or 1e-3 are sufficient to introduce a new concept to the model. A lower learning rate combined with extended training steps allows the model to absorb more details from training samples.
Our method is also efficient. For instance, 500 steps at a 512x512 resolution with SD v1 on an iPhone takes about an hour, while on an iPad M2 or Mac Mini, it’s just ~20 minutes. Fine-tuning the SDXL at the same resolution and step count on an M2 Ultra takes 14 minutes. This speed makes it not only feasible but also practical to fine-tune models on personal devices for both professional and recreational purposes.
Over the past two weeks, we’ve collaborated closely with our community to test the LoRA training feature. The results have been impressive, with consistent character creation and “helper” LoRAs for hand and finger fixes. We’re eager to see what our users will come up with next.
The Future
At the beginning of our “Stable Diffusion on an iPhone” project, we estimated that about 50% of performance potential remained untapped for the inference code. This gap has since been closed with our Metal FlashAttention work.
Our training code is far from optimal. We believe there’s at least a 50% speed increase attainable, and we can reduce RAM usage by another 30% without additional quantization efforts and minimal impact on speed.
I’m also excited about the prospect of more creative control over the fine-tuning process, including features like ControlNet signal injection, combined base model and existing LoRA, and co-training with textual inversion.
Our implementation on iPhone supports the particular configuration at network dim = 8, 8-bit base SD model, and no text model co-training.
The fine-tuning uses AdamW optimizer with betas at 0.9, 0.999. Epsilon at 1e-8 while decay at 0.001.
With this release, we’ve also enabled LoRA export, allowing users to share their trained LoRAs on model-sharing sites.