Skip to content

[Feature Request] Efficient model loading #15080

Open
@elephantpanda

Description

@elephantpanda

Describe the feature request

If you currently load a model of say 5GB, it will first load the model into ram taking 5GB, then it will do some sort of duplication, using another 5GB RAM. Spiking at 10GB RAM. It then transfers 5GB to the GPU and removes the 10GB from the RAM. (I am using c# and directml)

This is extremely wasteful and unnecessary. Because of short spike as observed (notice the 'church spire'):

screenshotx

It means you need double the RAM you should actually require to run certain models.

I'm sure this can easily be overcome by loading the model piecemeal into RAM instead of inefficiently loading the whole model into RAM at once doing some wasterful duplication and then deleting the entire thing.

Alternatively some of that work could be shifted to the VRAM.

Either way, this spike in RAM is just a symptom of very inefficient model loading.

Basically, the model loading could be done more efficiently to avoid this spike in RAM. I'm sure there are ways to avoid this spike in RAM that could be achieved through clever optimisation tricks, quickly deleting of unused RAM and sequential model loading.

Describe scenario use case

To load large models without having to buy 2x the RAM you actually should require. (Remembering the average amount of RAM on a typical users PC is 8GB or even 4GB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestrequest for unsupported feature or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions