ggml : become thread-safe

ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602

We should be able to run inference on multiple graphs, backends and devices in parallel.
Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.