Bugfix: Offload of GGML-quantized model in torch.inference_mode() cm (#7525)

RyanJDick · web-flow · commit 6b18f270dda5 · 2025-01-07T11:31:20.000-05:00
## Summary This PR contains a bugfix for an edge case with model unloading (from VRAM to RAM). Thanks to @JPPhoto for finding it. The bug was triggered under the following conditions: - A GGML-quantized model is loaded in VRAM - We run a Spandrel image-to-image invocation (which is wrapped in a `torch.inference_mode()` context manager. - The model cache attempts to unload the GGML-quantized model from VRAM to RAM. - Doing this inside of the `torch.inference_mode()` cm results in the following error: ``` [2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Error while invoking session 98a07259-0c03-4111-a8d8-107041cb86f9, invocation d8daa90b-7e4c-4fc4-807c-50ba9be1a4ed (spandrel_image_to_image): Cannot set version_counter for inference tensor [2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Traceback (most recent call last): File "/home/ryan/src/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node output = invocation.invoke_internal(context=context, services=self._services) File "/home/ryan/src/InvokeAI/invokeai/app/invocations/baseinvocation.py", line 300, in invoke_internal output = self.invoke(context) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/ryan/src/InvokeAI/invokeai/app/invocations/spandrel_image_to_image.py", line 167, in invoke with context.models.load(self.image_to_image_model) as spandrel_model: File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/load_base.py", line 60, in __enter__ self._cache.lock(self._cache_record, None) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 224, in lock self._load_locked_model(cache_entry, working_mem_bytes) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 272, in _load_locked_model vram_bytes_freed = self._offload_unlocked_models(model_vram_needed, working_mem_bytes) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 458, in _offload_unlocked_models cache_entry_bytes_freed = self._move_model_to_ram(cache_entry, vram_bytes_to_free) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 330, in _move_model_to_ram return cache_entry.cached_model.partial_unload_from_vram( File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/cached_model/cached_model_with_partial_load.py", line 182, in partial_unload_from_vram cur_state_dict = self._model.state_dict() File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1939, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1936, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1843, in _save_to_state_dict destination[prefix + name] = param if keep_vars else param.detach() RuntimeError: Cannot set version_counter for inference tensor ``` ### Explanation From the `torch.inference_mode()` docs: > Code run under this mode gets better performance by disabling view tracking and version counter bumps. Disabling version counter bumps results in the aforementioned error when saving `GGMLTensor`s to a state_dict. This incompatibility between `GGMLTensors` and `torch.inference_mode()` is likely caused by the custom tensor type implementation. There may very well be a way to get these to cooperate, but for now it is much simpler to remove the `torch.inference_mode()` contexts. Note that there are several other uses of `torch.inference_mode()` in the Invoke codebase, but they are all tight wrappers around the inference forward pass and do not contain the model load/unload process. ## Related Issues / Discussions Original discussion: https://discord.com/channels/1020123559063990373/1149506274971631688/1326180753159094303 ## QA Instructions Find a sequence of operations that triggers the condition. For me, this was: - Reserve VRAM in a separate process so that there was ~12GB left. - Fresh start of Invoke - Run FLUX inference with a GGML 8K model - Run Spandrel upscaling Tests: - [x] Confirmed that I can reproduce the error and that it is no longer hit after the change - [x] Confirm that there is no speed regression from switching from `torch.inference_mode()` to `torch.no_grad()`. - Before: `50.354s`, After: `51.536s` ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
diff --git a/invokeai/app/invocations/spandrel_image_to_image.py b/invokeai/app/invocations/spandrel_image_to_image.py
@@ -151,7 +151,7 @@ def upscale_image(
 
         return pil_image
 
-    @torch.inference_mode()
+    @torch.no_grad()
     def invoke(self, context: InvocationContext) -> ImageOutput:
         # Images are converted to RGB, because most models don't support an alpha channel. In the future, we may want to
         # revisit this.
@@ -197,7 +197,7 @@ class SpandrelImageToImageAutoscaleInvocation(SpandrelImageToImageInvocation):
         description="If true, the output image will be resized to the nearest multiple of 8 in both dimensions.",
     )
 
-    @torch.inference_mode()
+    @torch.no_grad()
     def invoke(self, context: InvocationContext) -> ImageOutput:
         # Images are converted to RGB, because most models don't support an alpha channel. In the future, we may want to
         # revisit this.