-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Scalability with high cardinality of subscriptions and published message topics #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
First off thank you for such a detailed report, much appreciated! Can you tell me a bit about what version you were using and the hardware setup? Size of machine, memory, CPU(s) etc? Thanks and I will dig in a bit on this. |
Could you email me the pprof trees so I can have a hires version? [email protected]. Thanks! |
Still be good to know what version of Go and what version of NATS we are talking about. Also hardware specs can be helpful as well. Thanks. |
go 1.10.3 |
Thanks again for the info, do you have a test that can reproduce? Or could you describe to me (you may have already, if so apologize) how to reproduce so I could write a test. Thx. |
I see the description above, so no need for a repeat, apologies for the request. If you would be willing to share your test program that would be great. |
What is a good example of a subject in your system? Looking for number of tokens, total length etc.. |
Sorry I was off. My example program use |
That would be great. Thanks! |
There you go: https://github.com/znly/natsworkout |
Thanks will take a look. |
Been looking at making the cache more efficient and avoiding the contention issues you are seeing. We could just give a flag to disable, but prefer a better solution. I think I am on the right track.
|
I believe this is addressed with merge of #726. Feel free to open if the issue comes back up with the next release. |
We use NATS primarily as a message routing system with a very high cardinality of topics of published messages and subscriptions. Most of the message topics won't find any subscription and our rate of subscription / unsubscription is pretty high (8-11K/s inserts and the same number of removes in the sublist) for a total number of 200-300k subscription. Our message rate though is pretty humble compared to what NATS can handle (8K/s).
Over that last few weeks we had a serie of production issues that were related to NATS. The symptoms were always the same, after some peak in the subscription rate the server becomes instable, refuses new clients and message rate drops. Slow consumers are detected, client are kicked and reconnect sending all their subscriptions back...
On my local machine I had a hard time to reproduce it, with the message and subscription rate of production the NATS server was able to handle the load and the insert/remove rate of subscription needed to be way bigger than the production to show the issue. Long story short I finally found that when publishing to topics that had a high cardinality of topics I was able to reproduce the issue that match the one we had in production and profile the server. Looking at contention profiles I saw that contention on the sublist lock was leading the profile which was not the case previously. The end-to-end publisher to subscription latency starts to grow almost immediately.
Low cardinality:
High cardinality:
After playing with the sublist I found that removing the sublist cache hash map completely solves my issue. Since the cache has a limited size of 1024 it is useless with a high cardinality of message topics and grabbing the write lock and maintain the cache creates too much contention for the server to handle the load.
https://github.com/nats-io/gnatsd/blob/master/server/sublist.go#L265-L277
With the cache removed the server has no problems and scales with a subscription rate more that 10x higher.
I believe this cache was added in order to increase the throughput, however it hurts us in our use case.
We are now running NATS with the cache completely removed and we no longer have issues and the CPU dropped about 30%.
I don't want to support a fork though and I'd like to submit a patch to add a command line flag that allow disable the cache. This would allow to support both the throughput oriented and the high cardinality topics / high subscription rate use case. Unless you have a better idea ?
The text was updated successfully, but these errors were encountered: