<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Engineering @ Draw Things]]></title><description><![CDATA[Talks about all engineering work @ Draw Things.]]></description><link>https://engineering.drawthings.ai</link><image><url>https://substackcdn.com/image/fetch/$s_!Y6tX!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85591798-29cd-4ff6-8a6a-8f3caa61cb4b_1024x1024.png</url><title>Engineering @ Draw Things</title><link>https://engineering.drawthings.ai</link></image><generator>Substack</generator><lastBuildDate>Thu, 09 Apr 2026 00:07:15 GMT</lastBuildDate><atom:link href="https://engineering.drawthings.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Engineering @ Draw Things]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[drawthingsapp@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[drawthingsapp@substack.com]]></itunes:email><itunes:name><![CDATA[Authors of Draw Things]]></itunes:name></itunes:owner><itunes:author><![CDATA[Authors of Draw Things]]></itunes:author><googleplay:owner><![CDATA[drawthingsapp@substack.com]]></googleplay:owner><googleplay:email><![CDATA[drawthingsapp@substack.com]]></googleplay:email><googleplay:author><![CDATA[Authors of Draw Things]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[2 Days to Ship: Codex-authored Metal Compute Shaders in Draw Things]]></title><description><![CDATA[In the 1.20260314.0 release, we shipped our first two Codex-authored Metal compute shaders, which improved LTX-2.3 video VAE decoding speed by about 2.4x on M1 through M4, and 4.7x on M5.]]></description><link>https://engineering.drawthings.ai/p/2-days-to-ship-codex-authored-metal</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/2-days-to-ship-codex-authored-metal</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Mon, 16 Mar 2026 18:53:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!W3bH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is not a post about the nitty-gritty details of the journey. The whole episode took 2 days. This is a short post, more along the lines of: &#8220;It is here. Get used to it.&#8221;</p><p>In <a href="https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820">the 1.20260314.0 release</a>, we shipped our first two Codex-authored Metal compute shaders, which improved LTX-2.3 video VAE decoding speed by about 2.4x on M1 through M4, and 4.7x on M5. On their own, the shaders delivered 4x and 8.3x speedups over the MPSGraph baseline, respectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h2W8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h2W8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 424w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 848w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 1272w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h2W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png" width="1456" height="533" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:533,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h2W8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 424w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 848w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 1272w, https://substackcdn.com/image/fetch/$s_!h2W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59aa08f5-3d25-469c-adc0-4285296f10b4_1492x546.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>3D convolution is one of those operations that had its day in the past. These days, much of its usefulness has been diminished by the rise of transformer architectures. But for video generation, 3D convolution is still a handy little tool for mixing information across time. There are other ways to do it, but it is handy, readily available, and well optimized on NVIDIA platforms.</p><p>It is also supported on Apple platforms, at least since 2022, through the MPSGraph API. In Draw Things, we use that API to support 3D convolutions for the video VAE decoders of video diffusion models such as Wan 2.x and LTX-2.x. We always knew we could probably do better ourselves, but it was never a priority.</p><p>Until LTX-2.3.</p><p>LTX-2.3 updated its video VAE to improve reconstruction fidelity. The new decoder is deeper, and therefore requires more FLOPs to complete. On M5, our old LTX-2.3 video decoding pipeline could take as long as the entire diffusion process combined.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q_MH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q_MH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 424w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 848w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 1272w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q_MH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png" width="1194" height="340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63405,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://engineering.drawthings.ai/i/191163796?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q_MH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 424w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 848w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 1272w, https://substackcdn.com/image/fetch/$s_!q_MH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3e24ad-80ee-4bb1-9887-858d9f90f632_1194x340.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Started the work when this was posted.</figcaption></figure></div><p>In the <a href="https://github.com/liuliu/example_matmul_metal4">experimental repo</a>, Codex quickly gave us some initial numbers. The MPSGraph 3D convolution API could only reach about 1.1 TFLOPs, far below the roughly 12 TFLOPs we can achieve with our GEMM kernel on M5. Further analysis with Metal Frame Capture showed that the MPSGraph 3D convolution code path was not using the neural accelerator.</p><p>After that, the breakdown for how Codex should approach the problem was straightforward:</p><ol><li><p>Give it access to the Metal Shading Language Specification PDF, and ask it to implement 3D convolution using the new 2D convolution tensor ops API.</p></li><li><p>Ask it to write unit tests to verify the results.</p></li><li><p>Give it development access to an M5 iPad so that it can benchmark the initial kernel.</p></li><li><p>Give it access to our old GEMM kernel so that it can understand the optimization target, namely reaching FLOPs parity.</p></li></ol><p>There was some back-and-forth on how padding should be supported, but by the end of March 12, we had shipped a build with a Codex-authored, neural-accelerator-enabled 3D convolution shader to TestFlight testers.</p><p>I used that time to do a bit more research on the topic and found the <a href="http://github.com/ml-explore/mlx/pull/3147">MLX PR</a> discussing the use of implicit GEMM to implement 3D convolution. The concept is similar to how we do that for 2D convolution. The next day, I gave Codex the PR link, our GEMM kernel implementation, and asked it to implement 3D convolution on an M3 Pro using implicit GEMM. Since we already had pretty good test coverage, Codex more or less auto-piloted to the finish line with a shader that quickly came within 10% of FLOPs parity. That was already 4x the performance of the MPSGraph implementation.</p><p>From there, I migrated the session to an M2 Ultra and asked Codex to benchmark it there and continue optimizing if needed. It initially showed some inefficiencies on M2 Ultra, which has more GPU cores, but Codex was able to ramp performance up quickly from about 10 TFLOPs to about 20 TFLOPs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W3bH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W3bH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 424w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 848w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 1272w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W3bH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png" width="1456" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W3bH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 424w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 848w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 1272w, https://substackcdn.com/image/fetch/$s_!W3bH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6fb8a1-d7ac-4e33-be07-ecb46f8a2c7a_1492x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The build with the new optimized 3D convolution compute shader for pre-M5 devices went out to TestFlight on Friday. Today, we released it to the public.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YrOQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YrOQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 424w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 848w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 1272w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YrOQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png" width="1456" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YrOQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 424w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 848w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 1272w, https://substackcdn.com/image/fetch/$s_!YrOQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4e011a-85dd-4c3b-954b-ecddd0cb4e49_1492x922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While I still think we can do better with hand-rolled shaders, this kind of work usually takes us weeks, if not months. With Codex, it took 2 days.</p><p>It is here. Get used to it.</p>]]></content:encoded></item><item><title><![CDATA[Optimizing Qwen Image for edge devices]]></title><description><![CDATA[Qwen Image is the largest open-weight state-of-the-art image generation model. With its 20B parameters, it poses unique challenges for Draw Things to support across iPhone, iPad, and Mac.]]></description><link>https://engineering.drawthings.ai/p/optimizing-qwen-image-for-edge-devices</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/optimizing-qwen-image-for-edge-devices</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Fri, 05 Sep 2025 17:20:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JLaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Draw Things supports local inference on iPhone, iPad, and Mac&#8212;including devices released as far back as five years ago.</p><p>Qwen Image is a 20B-parameter image generation model that delivers next-level prompt adherence. <strong>Qwen Image Edit</strong> is a fine-tuned variant with editing capabilities. At its core, the model is a 60-layer MMDiT transformer combined with a fine-tuned Wan 2.x video VAE (from Alibaba&#8217;s video generation model).</p><p>As discussed in <a href="https://engineering.drawthings.ai/p/bf16-and-image-generation-models-803cf0515bee">BF16 and Image Generation Models</a>, MMDiT tends to gradually increase the activation scale during training. This was already an issue for Hunyuan and FLUX.1, which contain only 20 and 18 layers of MMDiT blocks respectively. For Qwen Image, activations can reach magnitudes on the order of ~20 million.</p><p>On Apple hardware from the M1/M2 era, it is generally better to keep most computation in FP16 to avoid BF16 emulation. However, due to Qwen&#8217;s drastic increase in activation range, while main activations can still accumulate in FP32, far more FP16 activations need scaling than in earlier models.</p><h3>Activation Dynamics</h3><p>In each MMDiT block, two pathways feed activations back into the FP32 main path:</p><ul><li><p>the <code>out_proj</code> result after attention,</p></li><li><p>and the <code>FFN</code> result.</p></li></ul><p>For earlier models, only the <code>FFN</code> computation required scaling. In Qwen, later layers must produce sufficiently large activations in both pathways to influence the FP32 path&#8212;causing FP16 overflow unless scaling is applied.</p><p>To address this, we adopted a more aggressive down-scaling strategy:</p><ul><li><p>The input to <code>q/k/v</code> projection is down-scaled by 8 (with RMS norm epsilon adjusted accordingly).</p></li><li><p>The attention output is further down-scaled by 2, then up-scaled back after <code>out_proj</code> into the FP32 main path.</p></li><li><p>For the FFN, we use a 32&#215; down-scale factor for layers 0&#8211;58, and an even more aggressive 512&#215; factor for layer 59.</p></li></ul><p>With these strategies, we can run both Qwen Image and Qwen Image Edit in FP16 with minimal accuracy loss. A BF16 version is also provided, allowing only the critical layers requiring scaling to run in BF16 without scaling&#8212;minimizing the impact of BF16 emulation on older devices.</p><p>Read this post for a comparision of the activation scaling impact.</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:170984339,&quot;url&quot;:&quot;https://releases.drawthings.ai/p/introducing-qwen-image-support&quot;,&quot;publication_id&quot;:5952909,&quot;publication_name&quot;:&quot;Releases @ Draw Things&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!z_ZD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88f0328e-c1ab-411f-9649-a4dfd7c935b3_1024x1024.png&quot;,&quot;title&quot;:&quot;Introducing Qwen Image Support&quot;,&quot;truncated_body_text&quot;:&quot;Qwen Image is the most powerful open-source image generation model to date, released by the Qwen team at Alibaba. We&#8217;ve been working hard to support it across the Apple ecosystem, and we&#8217;re happy to announce that it is now broadly available through the Draw Things app. From iPhone to Mac, Apple devices released within the past five years can run this st&#8230;&quot;,&quot;date&quot;:&quot;2025-08-15T17:45:29.322Z&quot;,&quot;like_count&quot;:10,&quot;comment_count&quot;:2,&quot;bylines&quot;:[{&quot;id&quot;:349217964,&quot;name&quot;:&quot;Authors of Draw Things&quot;,&quot;handle&quot;:&quot;drawthingsapp&quot;,&quot;previous_name&quot;:&quot;Engineering @ Draw Things&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88f0328e-c1ab-411f-9649-a4dfd7c935b3_1024x1024.png&quot;,&quot;bio&quot;:&quot;Talks about all work @ Draw Things.&quot;,&quot;profile_set_up_at&quot;:&quot;2025-05-28T17:03:48.950Z&quot;,&quot;reader_installed_at&quot;:null,&quot;publicationUsers&quot;:[{&quot;id&quot;:5261570,&quot;user_id&quot;:349217964,&quot;publication_id&quot;:5158040,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:5158040,&quot;name&quot;:&quot;Engineering @ Draw Things&quot;,&quot;subdomain&quot;:&quot;drawthingsapp&quot;,&quot;custom_domain&quot;:&quot;engineering.drawthings.ai&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Talks about all engineering work @ Draw Things.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85591798-29cd-4ff6-8a6a-8f3caa61cb4b_1024x1024.png&quot;,&quot;author_id&quot;:349217964,&quot;primary_user_id&quot;:349217964,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-05-28T17:04:34.855Z&quot;,&quot;email_from_name&quot;:&quot;Engineering @ Draw Things&quot;,&quot;copyright&quot;:&quot;Engineering @ Draw Things&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}},{&quot;id&quot;:6072185,&quot;user_id&quot;:349217964,&quot;publication_id&quot;:5952909,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:5952909,&quot;name&quot;:&quot;Releases @ Draw Things&quot;,&quot;subdomain&quot;:&quot;releasesdrawthings&quot;,&quot;custom_domain&quot;:&quot;releases.drawthings.ai&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Talks about Draw Things releases.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88f0328e-c1ab-411f-9649-a4dfd7c935b3_1024x1024.png&quot;,&quot;author_id&quot;:349217964,&quot;primary_user_id&quot;:null,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-08-11T02:35:49.024Z&quot;,&quot;email_from_name&quot;:&quot;Releases @ Draw Things&quot;,&quot;copyright&quot;:&quot;Engineering @ Draw Things&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null,&quot;status&quot;:{&quot;bestsellerTier&quot;:null,&quot;subscriberTier&quot;:null,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:{&quot;type&quot;:&quot;unverified&quot;}}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://releases.drawthings.ai/p/introducing-qwen-image-support?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!z_ZD!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88f0328e-c1ab-411f-9649-a4dfd7c935b3_1024x1024.png" loading="lazy"><span class="embedded-post-publication-name">Releases @ Draw Things</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">Introducing Qwen Image Support</div></div><div class="embedded-post-body">Qwen Image is the most powerful open-source image generation model to date, released by the Qwen team at Alibaba. We&#8217;ve been working hard to support it across the Apple ecosystem, and we&#8217;re happy to announce that it is now broadly available through the Draw Things app. From iPhone to Mac, Apple devices released within the past five years can run this st&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">8 months ago &#183; 10 likes &#183; 2 comments &#183; Authors of Draw Things</div></a></div><h3>Video VAE</h3><p>Qwen Image uses Wan 2.x&#8217;s video VAE to encode and decode latent space. Like FLUX.1&#8217;s VAE, it has a similar parameter count, but it employs causal 3D convolution for many operations. Naively applying the video VAE for first-frame decoding makes image generation slow: decoding a 1024&#215;1024 image can take 5&#8211;6 seconds on an M3 Pro.</p><p>Looking deeper, however, Wan&#8217;s video VAE applies zero padding for previous frames during first-frame decoding. In these cases, full 3D convolution is unnecessary. By adjusting convolution weights/biases and switching to 2D convolution, we reduce decoding time to under a second for the same resolution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JLaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JLaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JLaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/078d2ace-918f-4031-935e-589b166bf105_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JLaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!JLaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F078d2ace-918f-4031-935e-589b166bf105_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Timestep-based Adaptive Layer-Norm</h3><p>Of Qwen Image&#8217;s 20B parameters, about 7B are allocated to adaptive layer norm. In our <a href="https://engineering.drawthings.ai/p/from-iphone-ipad-to-mac-enabling-rapid-local-deployment-of-sd3-medium-with-s4nnc-324bd5e81cd5">SD3 optimization article</a>, we described splitting the model to save VRAM. Unlike FLUX.1 or Hunyuan (which also use adaptive layer norm with MMDiT), Qwen Image&#8217;s adaptive layer norm depends only on the timestep.</p><p>This leads to an interesting implication: if we discretize timesteps between 0 and 1000, we only need to store 1001&#215;718&#215;3072 possible values, a lower number than the ~7B parameters required to generate them. In reality, timesteps in flow-matching models are not strictly discrete, but practitioners often fix the number of steps and shift values. By caching these projected conditions, we can avoid loading ~7B parameters into memory.</p><p><strong>Note:</strong> This optimization isn&#8217;t necessary when the weights already reside in VRAM&#8212;the incremental FLOPs are minimal. It primarily helps when loading ~7B parameters into RAM is the bottleneck.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5rbg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5rbg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5rbg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5rbg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!5rbg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f564681-88af-45d8-a952-a4501adecd4c_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[BF16 and image generation models]]></title><description><![CDATA[Draw Things maintains its own local inference and training stack for image generation models.]]></description><link>https://engineering.drawthings.ai/p/bf16-and-image-generation-models-803cf0515bee</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/bf16-and-image-generation-models-803cf0515bee</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Wed, 30 Apr 2025 20:04:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HJ9s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://drawthings.ai">Draw Things</a> maintains its own <a href="https://github.com/drawthingsai/draw-things-community/tree/main/Libraries/SwiftDiffusion">local inference and training stack</a> for image generation models. We support diffusion transformer models ranging from small (<a href="https://huggingface.co/stabilityai/stable-diffusion-3.5-medium">SD Medium 3.5</a>, a 2.5B parameter model), to medium (<a href="https://huggingface.co/black-forest-labs/FLUX.1-dev">FLUX.1</a>, a 12B parameter model), to large-scale models like <a href="https://huggingface.co/HiDream-ai/HiDream-I1-Full">HiDream I1</a> (17B parameters).</p><p>One area that has received little attention is the activation dynamics of diffusion transformers as they deepen. Architectures&#8202;&#8212;&#8202;particularly FLUX.1, a variant of MMDiT&#8202;&#8212;&#8202;tend to progressively increase activation scale. A common solution is to use BF16, which has a much larger dynamic range. This is one reason BF16 has gained popularity in image generation models.</p><p>However, BF16 brings its own challenges. Its reduced mantissa (compared to FP16) can lead to <a href="https://arxiv.org/abs/2411.13476">accuracy</a> <a href="https://blog.comfy.org/p/updates-for-wan-21-and-hunyuan-image#%C2%A7wan-in-fp">issues</a>, and support for BF16 on older Apple Silicon (M1/M2) is limited. Software emulation only matured in macOS 15&#8202;&#8212;&#8202;and even then, it&#8217;s roughly 50% slower than FP16 on these platforms.</p><p>Over the past year, Draw Things has refined its FP16 support to enable efficient execution of large diffusion transformer models on M1/M2, often achieving performance comparable to M3/M4. In this post, we&#8217;re sharing our general approach and model-specific tuning strategies to make FP16 viable. Our hope is to help extend support for cutting-edge models to a wider range of edge devices&#8202;&#8212;&#8202;especially for users without hardware-accelerated BF16 or who find BF16 accuracy unsatisfactory.</p><h4>FP32</h4><p>In diffusion transformers, a key challenge lies in the final layer normalization prior to projecting back into the latent space. This layernorm allows the preceding activations to scale freely, often beyond FP16&#8217;s representable range.</p><p>To address this, we upcast the main activation accumulation path to FP32. This sidesteps dynamic range limitations without significant performance cost&#8202;&#8212;&#8202;since element-wise operations (including layernorm) are not the main bottleneck in image or video generation.</p><h4>FP16 &amp; Transformer Block</h4><p>For all MMDiT / DiT variants we&#8217;ve encountered, the main activation is routed through an adaptive layernorm before entering the transformer block. This presents a clean boundary where we can safely convert activations to FP16 and run the rest of the block in FP16.</p><p>For many models&#8202;&#8212;&#8202;such as the Wan 2.1 series&#8202;&#8212;&#8202;this is sufficient. But in some MMDiT variants with large MLP intermediate dimensions, additional care is needed.</p><h4>Scaling in MLP Layers</h4><p>The MLP layers often include a wide projection before collapsing back to the hidden dimension. The down-projection can lead to activation overflow in FP16. To mitigate this, we apply conservative scaling factors (typically &#8539; or &#188;).</p><p>These factors are conservative by design: FP16&#8217;s 10-bit mantissa offers ~3 bits more precision than BF16, so a scaled FP16 activation (e.g., by &#8539;) often retains more numerical fidelity than an unscaled BF16 value.</p><p>In these paths, we upcast to FP32 after the final GEMM in the MLP and perform the adaptive layernorm gating in float32.</p><h4>Attention Scaling Strategy</h4><p>Attention operations typically apply a scaling factor of 1/&#8730;k. In our FlashAttention implementation, accumulation occurs in FP16, which may still result in range issues.</p><p>We&#8217;ve found that applying the scaling factor <em>before</em> the attention (rather than fusing it inside the kernel) helps mitigate overflow and preserves numerical stability <em>for our particular implementation</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HJ9s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HJ9s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 424w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 848w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HJ9s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png" width="1456" height="2184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Annotated DiT block with FP32 / FP16 mixed precision.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Annotated DiT block with FP32 / FP16 mixed precision." title="Annotated DiT block with FP32 / FP16 mixed precision." srcset="https://substackcdn.com/image/fetch/$s_!HJ9s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 424w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 848w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 1272w, https://substackcdn.com/image/fetch/$s_!HJ9s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d424b18-ad7d-47d3-b900-995747c0d876_1600x2400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Annotated DiT block with FP32 / FP16 mixed precision.</figcaption></figure></div><h4>Exact Configurations</h4><p>Below are the exact FP16 tuning configurations we use in Draw Things for various models. These adjustments allow for stable FP16 inference without requiring full BF16 support:</p><p>FLUX.1</p><ul><li><p><strong>Activation scaling factor</strong>: 8</p></li><li><p><strong>Scaled layers</strong>: Double stream blocks 17, 18 (0-indexed)</p></li><li><p><a href="https://github.com/drawthingsai/draw-things-community/blob/main/Libraries/SwiftDiffusion/Sources/Models/Flux1.swift#L497">Implementation Details</a></p></li></ul><p>Hunyuan</p><ul><li><p><strong>Activation scaling factor</strong>: 8</p></li><li><p><strong>Scaled layers</strong>: All double stream blocks</p></li><li><p><a href="https://github.com/drawthingsai/draw-things-community/blob/main/Libraries/SwiftDiffusion/Sources/Models/Hunyuan.swift#L489">Implementation Details</a></p></li></ul><p>Wan 2.1 14B</p><ul><li><p><strong>Activation scaling</strong>: Not needed</p></li><li><p><strong>Pre-scaling</strong>: Applied for attention</p></li><li><p><a href="https://github.com/drawthingsai/draw-things-community/blob/main/Libraries/SwiftDiffusion/Sources/Models/Wan.swift#L93">Implementation Details</a></p></li></ul><p>HiDream</p><ul><li><p><strong>Activation scaling factor</strong>: 4</p></li><li><p><strong>Scaled layers</strong>: Double stream blocks 13, 14, 15 (0-indexed)</p></li><li><p><strong>Pre-scaling</strong>: Applied for attention</p></li><li><p><a href="https://github.com/drawthingsai/draw-things-community/blob/main/Libraries/SwiftDiffusion/Sources/Models/HiDream.swift#L403">Implementation Details</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Metal FlashAttention 2.0: pushing forward on-device inference & training on Apple silicon]]></title><description><![CDATA[Metal FlashAttention underpins Draw Things&#8217; claim of fastest image generation inside the Apple ecosystem.]]></description><link>https://engineering.drawthings.ai/p/metal-flashattention-2-0-pushing-forward-on-device-inference-training-on-apple-silicon-fe8aac1ab23c</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/metal-flashattention-2-0-pushing-forward-on-device-inference-training-on-apple-silicon-fe8aac1ab23c</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Tue, 07 Jan 2025 21:03:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jUL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><p><a href="https://engineering.drawthings.ai/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18">Metal FlashAttention</a> underpins <a href="https://drawthings.ai">Draw Things</a>&#8217; claim of fastest image generation inside the Apple ecosystem. It conserves system memory, it is fast and it supports a wide-array of devices with the oldest being iPhone 12 from more than 4 years ago.</p><p>Back in September, <a href="https://x.com/philipturnerar">Philip Turner</a> and I released Draw Things with Metal FlashAttention 2.0. Since then, we&#8217;ve integrated not only the forward pass (useful for inference) but also the experimental backward pass (useful for training). Combining together, Draw Things is the only efficient application on macOS / iOS that supports both inferencing and <a href="https://www.youtube.com/watch?v=6UNNcmbWxGc">fine-tuning</a> FLUX.1 [dev], a 11B-parameter, state-of-the-art image generation model. This major version upgrade delivers:</p><ul><li><p><strong>Up to 20%</strong> faster inference on newer hardware such as M3 / M4 / A17 Pro;</p></li><li><p>Carefully tuned memory precision / register precision choices to make FP16 inference more accurate and less prone to NaN errors;</p></li><li><p>Backward pass implementation that is <strong>up to 19%</strong> faster than naive implementations to support efficient training on Apple devices;</p></li><li><p>Better tuned parameters to deliver efficient inference and training for larger head dimensions;</p></li><li><p>Switch to runtime code generation for better compiler compatibility and ease-of-integration;</p></li><li><p>Support of BFloat16 emulation, with a slight deviation from certain rounding rules to run more efficiently on older devices;</p></li><li><p>Keeping performance consistent with a wide-array of sequence lengths and head dimensions (minimal performance cliffs).</p></li></ul><p>Translating these gains into real-world numbers, we see <strong>up to 20%</strong> improvement on inference for FLUX.1 models on M3 / M4 devices, <strong>up to 20%</strong> improvement on inference for SD3 / AuraFlow models on M3 / M4 devices. Similar improvements for SD3 / AuraFlow for older hardware and <strong>around 2%</strong> improvement on older hardware for FLUX.1 models.</p><p>Compared to other implementations, FLUX.1 integrated inside Draw Things is <strong>up to 25%</strong> faster than mflux implementation on M2 Ultra for each iteration, and more for end-to-end times; it is <strong>up to 94%</strong> faster than ggml implementations (also known as gguf format). SD Large 3.5 integrated inside Draw Things is <strong>up to 163%</strong> faster than DiffusionKit implementation for each iteration (on M2 Ultra).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jUL4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jUL4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jUL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jUL4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!jUL4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09060103-7c71-432f-937e-1d99013f4fa0_1600x728.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">M3 Pro (18 GPU cores, 18GiB RAM) runs at 5-bit quantization for Draw Things, 4-bit quantization for mflux and DiffusionKit. M4 Pro (20 GPU cores, 24GiB RAM) runs at 5-bit quantization for Draw Things, 8-bit quantization for mflux, 4-bit quantization for DiffusionKit, and 8-bit quantization for ComfyUI + gguf. M2 Ultra (76 GPU cores, 192GiB RAM) runs 8-bit quantization for Draw Things, no quantization for mflux, DiffusionKit, ComfyUI + PyTorch, and 8-bit quantization for ComfyUI + gguf.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mMAN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mMAN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mMAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mMAN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!mMAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aa37424-9577-4553-a736-7002ac36baf2_1600x728.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">M3 Pro (18 GPU cores, 18GiB RAM) runs at 5-bit quantization for Draw Things, and 4-bit quantization for DiffusionKit. M4 Pro (20 GPU cores, 24GiB RAM) runs at 5-bit quantization for Draw Things, and 4-bit quantization for DiffusionKit. M2 Ultra (76 GPU cores, 192GiB RAM) runs 8-bit quantization for Draw Things, no quantization for DiffusionKit.</figcaption></figure></div><blockquote><p>mflux: 0.5.1, DiffusionKit: 0.5.2, mlx: 0.21.1, ComfyUI: v0.3.8+PyTorch v2.6.0.dev20241218</p></blockquote><p>On the training side, training SDXL LoRA at 1024x1024 now is <strong>2%</strong> faster than our previous implementation in Balanced mode. There is no comparison for training FLUX.1 LoRAs on macOS, our implementation scores <strong>9s per step per image</strong> at 1024x1024 resolution on M2 Ultra.</p><p>With the release of Metal FlashAttention 2.0, we invite the community to collaborate and extend this implementation to more downstream frameworks. Our reference Swift implementation is available at: <a href="https://github.com/philipturner/metal-flash-attention">https://github.com/philipturner/metal-flash-attention</a>. Our C++ implementation is available as part of ccv: <a href="https://github.com/liuliu/ccv/tree/unstable/lib/nnc/mfa">https://github.com/liuliu/ccv/tree/unstable/lib/nnc/mfa</a>.</p><div><hr></div><h4>Appendix</h4><p>Comparison with other SDPA (Scaled-dot Product Attention) kernel implementations (MLX, Apple MPSGraph). See raw data at <a href="https://docs.google.com/spreadsheets/d/1NHzYHcqtH5xb18trn9NyTc1EeSfXZen9C7E7vsPb_lI/edit?usp=sharing">https://docs.google.com/spreadsheets/d/1NHzYHcqtH5xb18trn9NyTc1EeSfXZen9C7E7vsPb_lI/edit?usp=sharing</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!09wj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!09wj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!09wj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!09wj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!09wj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!09wj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!09wj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!09wj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!09wj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!09wj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b0f6c5-4eba-4bf5-b674-58792abdaae4_1600x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MLX: 92ab6bdeb862625d18136d459d0792c2edd4569d, MPSGraph: macOS 15.1.1</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!An5p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!An5p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!An5p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!An5p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!An5p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!An5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!An5p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 424w, https://substackcdn.com/image/fetch/$s_!An5p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 848w, https://substackcdn.com/image/fetch/$s_!An5p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 1272w, https://substackcdn.com/image/fetch/$s_!An5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33a2415e-a015-418d-9351-d9fb4691a46f_1600x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MLX: 92ab6bdeb862625d18136d459d0792c2edd4569d, MPSGraph: macOS 15.1.1</figcaption></figure></div><p>End-to-end benchmark data is available at <a href="https://docs.google.com/spreadsheets/d/1A8xC2_wh_Nwc5p2uvNMnKMtN4kkJac1E764XHrADpBs/edit?usp=sharing">https://docs.google.com/spreadsheets/d/1A8xC2_wh_Nwc5p2uvNMnKMtN4kkJac1E764XHrADpBs/edit?usp=sharing</a></p>]]></content:encoded></item><item><title><![CDATA[From iPhone, iPad to Mac - enabling rapid local deployment of SD3 Medium with s4nnc]]></title><description><![CDATA[SD3 Medium was released on June 12th, 2024.]]></description><link>https://engineering.drawthings.ai/p/from-iphone-ipad-to-mac-enabling-rapid-local-deployment-of-sd3-medium-with-s4nnc-324bd5e81cd5</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/from-iphone-ipad-to-mac-enabling-rapid-local-deployment-of-sd3-medium-with-s4nnc-324bd5e81cd5</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Mon, 24 Jun 2024 20:02:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DhLO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://huggingface.co/stabilityai/stable-diffusion-3-medium">SD3 Medium</a> was released on June 12th, 2024. Like everyone else, we gained access to the model on the same day. From then on, it is a race to deploy the said model to <a href="https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820">Draw Things</a> users on iPhone, iPad and Mac. In this post, I will outline the tools we used, the lessons we learned, and the unique optimizations we applied to ensure the best-in-the-class performance across a broad range of Apple devices.</p><h3>Model Conversion</h3><p>Over the past year, we&#8217;ve significantly streamlined our model conversion workflow. What used to take weeks with Stable Diffusion 1.4 now takes about a day. For example, we implemented our FP16 version of SD3 Medium on June 13th, 24 hours after the release.</p><p>To deploy cutting-edge image/text generative models to local devices, we use Swift implementations that compile natively on these platforms. This involves translating Python code, typically written in PyTorch, into Swift. We begin this by setting up the correct Python environment, creating minimal viable inference code to correctly call the model, inspecting the result, and then implementing the Swift code.</p><p><a href="https://github.com/pvieito/PythonKit">PythonKit</a> has been essential for our conversion work, allowing us to run Python reference code directly alongside our Swift reimplementation. The first-class support of <a href="https://github.com/liuliu/s4nnc">s4nnc</a> on CUDA systems also enables us to run our Swift reimplementation on Linux systems with CUDA, which is often the most hassle-free environment for running PyTorch inference code.</p><p>Our reimplementation generally involves rewriting the PyTorch model into a more declarative Swift model and comparing outputs layer by layer. This is particularly straightforward with transformer models, where each layer follows the same architecture.</p><p>Our implementation: <a href="https://github.com/liuliu/swift-diffusion/blob/main/examples/sd3/main.swift#L502-L661">https://github.com/liuliu/swift-diffusion/blob/main/examples/sd3/main.swift#L502-L661</a></p><p>SD3 Ref: <a href="https://github.com/Stability-AI/sd3-ref/blob/master/mmdit.py#L11-L619">https://github.com/Stability-AI/sd3-ref/blob/master/mmdit.py#L11-L619</a></p><h3>Model Quantization</h3><p>Deploying large models to local devices often requires weight quantization. For image generative models, we carefully balance quality and size trade-offs. With Draw Things, we ensure all our quantized models are practically &#8220;lossless.&#8221; We focus on sensible reductions that maintain compatibility across a wide range of devices rather than pushing for the smallest possible model size.</p><p>Currently, s4nnc supports limited quantization options, including 4-bit, 6-bit, and 8-bit block palletization as our main schemes. For diffusion models, we use the mean squared error metrics of the final image between quantized and non-quantized models to guide our decisions. We selected 8-bit quantization for SD3 Medium and 6-bit for the T5 encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DhLO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DhLO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 424w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 848w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 1272w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DhLO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png" width="1152" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1152,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;8-bit quantized model.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="8-bit quantized model." title="8-bit quantized model." srcset="https://substackcdn.com/image/fetch/$s_!DhLO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 424w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 848w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 1272w, https://substackcdn.com/image/fetch/$s_!DhLO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c5ba35-0ca7-4130-a4b1-5afad8ff8158_1152x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image from 8-bit quantized model.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K2YP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K2YP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 424w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 848w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 1272w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K2YP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png" width="1152" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1152,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;non-quantized model.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="non-quantized model." title="non-quantized model." srcset="https://substackcdn.com/image/fetch/$s_!K2YP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 424w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 848w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 1272w, https://substackcdn.com/image/fetch/$s_!K2YP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda3fae26-0343-4612-80eb-c2cda79ce97d_1152x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image from FP16 non-quantized model.</figcaption></figure></div><h3>Model Optimization</h3><p>Unlike the UNet in SDXL/SD v1.5, SD3 Medium uses straightforward transformer blocks, limiting optimization opportunities&#8202;&#8212;&#8202;especially regarding FLOPs. However, we managed to split the model to reduce peak RAM usage during the diffusion sampling process to approximately 2.2 GiB for the quantized model (around 3.3 GiB for the non-quantized model).</p><p>This is possible by observing that while adaptive layer norm blocks are minimal in FLOPs, they have a high parameter count, around 670M. Since the input for the adaptive layer norm includes timestep conditioning, we cannot reduce FLOP computation. However, since there are no dependencies on model intermediate activations, we can batch the adaptive layer norm computation of every timestep to the beginning of diffusion sampling all at once, converting matrix-vector multiplication to matrix-matrix multiplication, which is slightly more efficient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1dJV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1dJV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1dJV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1dJV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!1dJV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c2adc7-f5a4-4c80-8f40-a78674fc861e_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">App Memory Usage measured from within Xcode</figcaption></figure></div><p>Thanks to these optimizations, we implemented the fastest SD3 Medium model inference on macOS, iOS, and iPadOS systems with minimal RAM usage and successfully shipped it to real users within a practical app.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0VDp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0VDp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 424w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 848w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 1272w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0VDp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png" width="1200" height="822" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0VDp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 424w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 848w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 1272w, https://substackcdn.com/image/fetch/$s_!0VDp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ce2f58-e80b-4f53-afad-2e5394725ba0_1200x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">ComfyUI and Draw Things both loads the model from disk during generation. CoreML-based Diffusers&#8217; macOS app (from macOS AppStore) can only do 512x512. 1024x1024 + T5 available in GitHub repository but cannot be effectively run with 18GiB RAM MacBook.</figcaption></figure></div><h3>Future Directions</h3><p>Our implementation can provide valuable feedback into the training process. Moving forward, we aim to conduct more research and ablation studies to explore:</p><p>1. Optimal parameter count distribution for adaptive layer norm&#8202;&#8212;&#8202;could we allocate fewer parameters here, and more to MLP/QKV projection?</p><p>2. Comparing more quantization schemes to identify per-layer improvements and establishing an unbiased prompt dataset for the future data-free fine-tuning.</p><p>3. Leveraging torch.compile to rewrite the PyTorch model in Swift, all from within Swift using PythonKit.</p><p>We are excited to continue our research and share our development work in the future.</p>]]></content:encoded></item><item><title><![CDATA[Draw Things democratizes local large model fine-tuning on iPhone, iPad and Mac]]></title><description><![CDATA[Large language models and image generation models currently require hundreds of thousands to tens of millions of dollars for hardware acquisition and training.]]></description><link>https://engineering.drawthings.ai/p/draw-things-democratizes-local-large-model-fine-tuning-on-iphone-ipad-and-mac-2ceb60b5b462</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/draw-things-democratizes-local-large-model-fine-tuning-on-iphone-ipad-and-mac-2ceb60b5b462</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Thu, 05 Oct 2023 20:02:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OUgj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Large language models and image generation models currently require hundreds of thousands to tens of millions of dollars for hardware acquisition and training. As a result, fine-tuning these pre-trained models&#8202;&#8212;&#8202;whether to introduce new concepts or adapt them to specific tasks&#8202;&#8212;&#8202;has become an active research area over the past two years. Techniques like <a href="https://arxiv.org/abs/2106.09685">LoRA</a>, <a href="https://arxiv.org/abs/2305.14314">QLoRA</a>, and <a href="https://arxiv.org/abs/2110.02861">8Bit-Adam</a> have made it feasible to fine-tune large models on consumer PCs. CUDA-based software, such as <a href="https://github.com/TimDettmers/bitsandbytes">bitsandbytes</a>, has been integrated into numerous open-source packages, facilitating fine-tuning for those with NVIDIA hardware.</p><p>However, since these advancements were primarily within the CUDA ecosystem, model fine-tuning was largely exclusive to powerful NVIDIA servers and PCs.</p><p>With the release of <a href="https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820">Draw Things</a> version 1.20231004.1, we&#8217;ve extended the capability to fine-tune large image generation models like Stable Diffusion v1, v2, and XL to iPhone, iPad, and Mac.</p><h3>Benefits of On-device Fine-tuning</h3><p>On-device large model fine-tuning offers even more possibilities for AI-assisted creative workflows. Whether you&#8217;re using 3 to 4 photos to introduce a new identity to the model or hundreds of your artworks to teach it a new style, on-device fine-tuning ensures privacy and offers limitless customization. Fine-tuning on personal hardware provides a broader range of choices, from base model selection and image captioning to denoising schedules and learning rates. These training recipes are essential components of the creative process, not just for experimentations.</p><h3>The Path to On-device Fine-tuning</h3><p>Draw Things adopted a specific fine-tuning strategy known as LoRA. Specifically, our LoRA approach overlays the LoRA network on both linear and convolution layers (known as LoCon in Stable Diffusion community).</p><p>Our LoRA method also builds upon our <a href="https://engineering.drawthings.ai/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18">Metal FlashAttention</a> and JIT weight dequantization work, enabling us to train LoRA on quantized model weights directly, known as QLoRA.</p><p>For the first time ever, this setup allows the SD v1 model to be fine-tuned (at a 512x512 resolution) on an iPhone 15 Pro, with peak memory consumption around ~6GiB, including model weights. For the 3.5B parameter SDXL, our approach uses peak memory of approximately ~10.3GiB, making it possible to fine-tune such a large model on an iPad.</p><p>While the main network operates at FP16, the LoRA network runs at FP32 during training. This distinction stabilizes the training process, even with higher learning rates (up to 1e-3). Test users have reported that as few as 500 steps at learning rates of 1e-4 or 1e-3 are sufficient to introduce a new concept to the model. A lower learning rate combined with extended training steps allows the model to absorb more details from training samples.</p><p>Our method is also efficient. For instance, 500 steps at a 512x512 resolution with SD v1 on an iPhone takes about an hour, while on an iPad M2 or Mac Mini, it&#8217;s just ~20 minutes. Fine-tuning the SDXL at the same resolution and step count on an M2 Ultra takes 14 minutes. This speed makes it not only feasible but also practical to fine-tune models on personal devices for both professional and recreational purposes.</p><p>Over the past two weeks, we&#8217;ve collaborated closely with our community to test the LoRA training feature. The results have been impressive, with consistent character creation and &#8220;helper&#8221; LoRAs for hand and finger fixes. We&#8217;re eager to see what our users will come up with next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OUgj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OUgj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OUgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OUgj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 424w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 848w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 1272w, https://substackcdn.com/image/fetch/$s_!OUgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e4044-e94b-4b69-971a-27afa63839b2_800x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rKhV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rKhV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 424w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 848w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 1272w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rKhV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png" width="768" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efc633be-8027-4384-9f24-e18554e02d26_768x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rKhV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 424w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 848w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 1272w, https://substackcdn.com/image/fetch/$s_!rKhV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc633be-8027-4384-9f24-e18554e02d26_768x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4t_V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4t_V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4t_V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png" width="512" height="512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e96a7d51-1901-412d-b412-af175c99baeb_512x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:512,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4t_V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 424w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 848w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 1272w, https://substackcdn.com/image/fetch/$s_!4t_V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96a7d51-1901-412d-b412-af175c99baeb_512x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-ZMU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-ZMU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-ZMU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-ZMU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-ZMU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75f0841-edef-49e1-82aa-3ae63d79e1cc_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jVu0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jVu0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jVu0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/863148d6-45e6-4256-b6d3-763459006060_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jVu0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jVu0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F863148d6-45e6-4256-b6d3-763459006060_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">1, 4 by @wetcircuit, 2 by @&#20154;&#29983;&#21322;&#30334;&#22914;&#29017;&#36942;</figcaption></figure></div><h3>The Future</h3><p>At the beginning of our &#8220;<a href="https://liuliu.me/eyes/stretch-iphone-to-its-limit-a-2gib-model-that-can-draw-everything-in-your-pocket/">Stable Diffusion on an iPhone</a>&#8221; project, we estimated that about 50% of performance potential remained untapped for the inference code. This gap has since been closed with our <a href="https://engineering.drawthings.ai/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18">Metal FlashAttention</a> work.</p><p>Our training code is far from optimal. We believe there&#8217;s at least a 50% speed increase attainable, and we can reduce RAM usage by another 30% without additional quantization efforts and minimal impact on speed.</p><p>I&#8217;m also excited about the prospect of more creative control over the fine-tuning process, including features like ControlNet signal injection, combined base model and existing LoRA, and co-training with textual inversion.</p><blockquote><p>Our implementation on iPhone supports the particular configuration at network dim = 8, 8-bit base SD model, and no text model co-training.</p><p>The fine-tuning uses AdamW optimizer with betas at 0.9, 0.999. Epsilon at 1e-8 while decay at 0.001.</p><p>With this release, we&#8217;ve also enabled LoRA export, allowing users to share their trained LoRAs on model-sharing sites.</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Integrating Metal FlashAttention: accelerating the heart of image generation in the Apple ecosystem]]></title><description><![CDATA[Draw Things was the first practical app to run full-blown image generation models at the &#8220;edge&#8221; &#8212; directly on your mobile phone.]]></description><link>https://engineering.drawthings.ai/p/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18</link><guid isPermaLink="false">https://engineering.drawthings.ai/p/integrating-metal-flashattention-accelerating-the-heart-of-image-generation-in-the-apple-ecosystem-16a86142eb18</guid><dc:creator><![CDATA[Authors of Draw Things]]></dc:creator><pubDate>Wed, 09 Aug 2023 20:00:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://drawthings.ai">Draw Things</a> was the first practical app to run full-blown image generation models at the &#8220;edge&#8221;&#8202;&#8212;&#8202;directly on your mobile phone. Since its introduction, there&#8217;s been growing interest in locally-run open-source large models. <a href="https://github.com/ggerganov/llama.cpp">LLaMA.cpp</a> brought large language models to the laptop; <a href="https://mlc.ai/mlc-llm/">MLC LLM</a> executed language and image generation models in web browsers. What began as an academic exercise has evolved into a movement: just let me run my model, as powerful as the cloud ones, locally and free (&#8220;as in freedom&#8221;)!</p><p>Until now, most algorithmic innovations and improvements have occurred on NVIDIA CUDA hardware. It makes sense when most AI-related computing happens on server-side. However, as we move the compute closer to the edge, how to drive the same algorithmic innovations and improvements to one of the most used hardware platforms captured our imagination.</p><p>Over the past few months, <a href="https://twitter.com/philipturnerar">Philip Turner</a> and I worked closely to integrate his <a href="https://github.com/philipturner/metal-flash-attention">Metal FlashAttention</a> into the Draw Things app. With <a href="https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820">version 1.20230807.0 of the app</a>, it generally cuts the image generation time by half, often bringing superior experience than the cloud, with added benefits of privacy and freedom.</p><h3>Metal FlashAttention</h3><p>Metal FlashAttention comprises Metal compute shaders optimized for operations commonly found in large image generation and language models. That includes thin matrix multiplications (e.g. [4096, 320] x [320, 320]), scaled dot product attention (the heart of multi-head attention or transformers) and layer normalization. It stands as an open-source alternative to <a href="https://developer.apple.com/documentation/metalperformanceshaders">Metal Performance Shaders</a> (<a href="https://developer.apple.com/documentation/metalperformanceshadersgraph?language=objc">MPS</a>).</p><h4>GEMM</h4><p>GEMM computations, typically found in the Stable Diffusion variant of models (v1, v2, XL), don&#8217;t hit the sweet spot of Apple&#8217;s Metal Performance Shaders or MPSGraph implementation. Metal FlashAttention leverages the <code>simdgroup_async_copy</code> API (since A14), an undocumented hardware feature that overlaps compute and load instructions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YLQF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YLQF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 424w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 848w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 1272w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YLQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png" width="1456" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YLQF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 424w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 848w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 1272w, https://substackcdn.com/image/fetch/$s_!YLQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75074be9-6c2b-4da5-9e8e-a0f52e5788d3_1600x702.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>FlashAttention</h4><p>Inspired by the <a href="https://github.com/Dao-AILab/flash-attention">FlashAttention</a> project, Metal FlashAttention aimed to improve both latency and memory footprint. At the core, it is pretty easy to understand why: scaled dot product attention is concisely demonstrated by the two lines of code from the PyTorch documentation:</p><pre><code>attn_weight = torch.softmax(scale * (Q @ K.transpose(-2, -1)))
return attn_weight @ V</code></pre><p>There&#8217;s no need to to materialize the full <code>Q @ K.transpose(-2, -1)</code> matrix before computing the final result. Naively, you simply need one-row per <code>Q @ K.transpose(-2, -1)</code> to compute softmax and then do the final matrix multiplication. This approach, often referred to as attention slicing, has been attributed to the further performance improvements in <a href="https://github.com/apple/ml-stable-diffusion">apple/ml-stable-diffusion</a> (named as <code>SPLIT_EINSUM_V2</code>).</p><p>The original <a href="https://github.com/Dao-AILab/flash-attention">FlashAttention in CUDA</a> (by Dao AI Lab) focused on both forward and backward pass. Metal FlashAttention paid particular attention to the forward pass (inference). We made several optimizations to the original FlashAttention on the inference path, some of these optimizations are concurrently adopted in the FlashAttention v2 release. These optimizations decreased the total number of computations and increased numerical stability. We also made a unique block-sparse algorithm that automatically detects sparsity in the attention matrix. This approach allowed a single shader to handle sparse, causal, or irregular masks, especially masks that change dynamically at runtime.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7CtT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7CtT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 424w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 848w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 1272w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7CtT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png" width="1276" height="956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:956,&quot;width&quot;:1276,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7CtT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 424w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 848w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 1272w, https://substackcdn.com/image/fetch/$s_!7CtT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecf88627-6955-4e09-9efb-cc5c4a134953_1276x956.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RfJP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RfJP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 424w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 848w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 1272w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RfJP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png" width="1276" height="958" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:958,&quot;width&quot;:1276,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RfJP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 424w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 848w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 1272w, https://substackcdn.com/image/fetch/$s_!RfJP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf14fe7c-260e-4c91-8d92-dacd40a5cdce_1276x958.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The resulting speedup is not just measurable in percentages&#8202;&#8212;&#8202;it represents orders of magnitude. In the bottom graph, MPS performance was excluded as it couldn&#8217;t complete the benchmark in a reasonable time frame. It maxed out at 2000 GFLOPS (top), while MFA soared an order of magnitude higher (bottom).</p><h3>Real-world Impact</h3><p>The GEMM kernel of Metal FlashAttention has been integrated into the 1.20230726.0 release of the Draw Things app. The community has confirmed our claim of 10-30% performance improvements over many devices.</p><p>The full Metal FlashAttention integration, including GEMM with fused bias, scaled dot product attention with fused multi-head output projection, and custom layer normalization has gone through extensive testing &amp; benchmarking.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-yh7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-yh7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-yh7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png" width="1452" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-yh7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!-yh7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3728f4a-a147-4b22-806d-8796ced6c3a1_1452x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zbzl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zbzl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zbzl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png" width="1452" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zbzl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!Zbzl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3873bb-5638-451a-a1fc-f2e02ae9db29_1452x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!drQV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!drQV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!drQV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!drQV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!drQV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!drQV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png" width="1452" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!drQV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 424w, https://substackcdn.com/image/fetch/$s_!drQV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 848w, https://substackcdn.com/image/fetch/$s_!drQV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 1272w, https://substackcdn.com/image/fetch/$s_!drQV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c4f9c-cdb0-43d4-a75e-8c6c0f5814b3_1452x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Comparing w/out Metal FlashAttention, image generation latencies are roughly halved (43&#8211;120% faster). This speedup is consistent across a wide range of Stable Diffusion architectures, several device families (iPhone 12 and above, M1 and above) and every image resolution.</p><p>Comparing to CoreML implemented Stable Diffusion v1.x, v2.x with Apple Neural Engine, GPU-accelerated Metal FlashAttention slightly trailing behind ANE performance on M1 / M2 base model (~12.8s with ANE and ~15.2s with MFA, at 25 steps, 512x512) while outperforms them on M1 Pro / M2 Pro and above. With the v2.0 model on A16 chip (Stable Diffusion v1.x with CoreML cannot be run on iPhones without quantization), MFA trailing behind ANE performance (~26.3s with ANE and ~34s with MFA at 25 steps, 512x512) by ~22%.</p><p>Comparing to CoreML implemented Stable Diffusion v1.x, v2.x and XL with GPU, Metal FlashAttention outperforms in wider margin on M1 Pro / M2 Pro and above models (usually around 20% to 40% faster than CoreML GPU, <code>ORIGINAL</code> configuration). CoreML GPU implementation outperforms ANE implementation on these devices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0OVK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0OVK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0OVK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png" width="1200" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0OVK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 424w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 848w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 1272w, https://substackcdn.com/image/fetch/$s_!0OVK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffb3587-4e15-44bf-ae4a-d0f9d8aa40c5_1200x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Metal FlashAttention is the kind of optimization that raises the tide for us all. Unlike our CoreML integration, Metal FlashAttention improves performance across all image generation resolutions and with any given prompt length, Whether dealing with the standard 77-token prompt or thousands of tokens prompt, every workflow benefits from Metal FlashAttention&#8217;s integration. By integrating into our low-level framework directly, there is no first-time-model-loading cost, nor first-time-model-conversion cost associated with CoreML or other alternative runtimes.</p><blockquote><p>Above comparisons were done with Swift CoreML Diffusers app from macOS App Store (v1.1), with the exception of A16 performance (done with the CoreML implementation inside Draw Things app). SDXL comparison was done with Diffusers app built from source (commit: 4eab4767) with Xcode 15 beta 5 on macOS Sonoma beta 5, Release configuration, with FP16 SDXL Base CoreML model. Diffusers on iPad was built from source (commit: 4eab4767) with Xcode 14.3.1, Release configuration. Preview and Safety Checker were disabled. Both are measured with minimum of the 2nd and 3rd run, with Preload On in Draw Things (with exception of A16 / iPhone 14 Pro (too little RAM to preload the models). iPad Pro M2 is a 8GiB configuration so the Preload is automatically off for SDXL). All measurements are done with 25 steps.</p><p>Above comparisons were done with following hardware specs: Mac M2 refers to Mac Mini M2 with 16GiB RAM and 10 GPU cores. M1 Pro refers to MacBook Pro M1 Pro with 16GiB RAM and 14 GPU cores. M1 Max refers to Mac Studio M1 Max with 32GiB RAM and 24 GPU cores. M2 Ultra refers to Mac Studio M2 Ultra with 192GiB RAM and 76 GPU cores.</p></blockquote><h3>Sprinting Forward</h3><p>With Metal FlashAttention integrated into the Draw Things app, and our community enjoys faster image generation time, we&#8217;re eager to see Metal FlashAttention get integrated into many other applications and frameworks to empower local inferences from image generation models to large language models in the Apple ecosystem.</p><p>The Metal FlashAttention project, authored by Philip Turner and sponsored by Draw Things, is open-source and can be found at <a href="https://github.com/philipturner/metal-flash-attention">https://github.com/philipturner/metal-flash-attention</a> under MIT license.</p>]]></content:encoded></item></channel></rss>