Skip to content

Fixed deadlock in producer.send_async #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

merlimat
Copy link
Contributor

Fix #84

Release the GIL while calling producer.sendAsync() to avoid a deadlock when PyBind is triggering the Python callback.

  • Main Thread

    1. Holds the Python GIL
    2. Call producer.send_async()
    3. Tries to acquire internal ClientConnetion lock
  • Pulsar client internal thread

    1. Holds lock on ClientConnection
    2. Receives ack from the broker
    3. Triggers callback
    4. PyBind11 acquires GIL <---- Deadlock

The problem is the different behavior in PyBind from Boost::Python.

We always need to make sure we release the GIL before making any call to C++ that potentially acquires any mutexes

@merlimat merlimat added the bug Something isn't working label Jan 31, 2023
@merlimat merlimat added this to the 3.1.0 milestone Jan 31, 2023
@BewareMyPower BewareMyPower merged commit cff12ea into apache:main Feb 1, 2023
@merlimat merlimat deleted the fix-send-async-deadlock branch February 1, 2023 02:07
producer.sendAsync(msg, callback);
Py_END_ALLOW_THREADS

if (PyErr_CheckSignals() == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if an exception was already set in the interpreter global before the callback fired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the main thread, irrespective of the callback.

send_async() is a potentially blocking operation (when block_if_queue_full=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Something I've seen before is issues around an error already being "pre-raise" in a Python thread when some external code (the pulsar client C++ in this case) decides to invoke an unrelated Python callback. Does pybind handle situations like this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All set per conversation on #84

@@ -34,6 +34,16 @@ MessageId Producer_send(Producer& producer, const Message& message) {
return messageId;
}

void Producer_sendAsync(Producer& producer, const Message& msg, SendCallback callback) {
Py_BEGIN_ALLOW_THREADS
producer.sendAsync(msg, callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something inside this line handling getting the GIL around callback? Since callback is python code, running it outside of the GIL can break all sorts of stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zbentley PyBind is already acquiring the GIL before it enters the Python callback code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All set per conversation on #84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in concurrent send_async calls w/ pulsar-client-3.0
3 participants