Closed
Description
Hi! I was attempting to see if llama.cpp could be supported in LLMLingua (prompt compression) via llama-cpp-python, but it looks like attention masks are required. Attention masks are supported in transformers, and it would seem like they would enable more projects to work with llama.cpp.
I think that this might be worth pursuing in order to use LLMLingua in downstream projects, since CPU and partial-GPU prompt processing is obviously quite slow, and adds up for longer passages. Additionally, perhaps implementing LLMLingua's methods in llama.cpp is worth consideration?