Description
Hello everyone!
I'm currently enhancing the GGML implementation of a LSTM network.
My main focus is to avoid having scalability issues with the computational graph.
Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38
This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .
I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?
I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell.
It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.
What do you think would be the best solution?