Description
Describe the feature request
If you currently load a model of say 5GB, it will first load the model into ram taking 5GB, then it will do some sort of duplication, using another 5GB RAM. Spiking at 10GB RAM. It then transfers 5GB to the GPU and removes the 10GB from the RAM. (I am using c# and directml)
This is extremely wasteful and unnecessary. Because of short spike as observed (notice the 'church spire'):
It means you need double the RAM you should actually require to run certain models.
I'm sure this can easily be overcome by loading the model piecemeal into RAM instead of inefficiently loading the whole model into RAM at once doing some wasterful duplication and then deleting the entire thing.
Alternatively some of that work could be shifted to the VRAM.
Either way, this spike in RAM is just a symptom of very inefficient model loading.
Basically, the model loading could be done more efficiently to avoid this spike in RAM. I'm sure there are ways to avoid this spike in RAM that could be achieved through clever optimisation tricks, quickly deleting of unused RAM and sequential model loading.
Describe scenario use case
To load large models without having to buy 2x the RAM you actually should require. (Remembering the average amount of RAM on a typical users PC is 8GB or even 4GB)