Skip to content

Don't execute the proxy queue while adding to it from same thread. #24565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

brendandahl
Copy link
Collaborator

test_pthread_cancel was intermittently failing with a deadlock. From the backtrace, I found we were adding to the queue which can sometimes call malloc->emscripten_yield->emscripten_proxy_execute_queue which also tries to lock the same queue on the same thread.

@brendandahl
Copy link
Collaborator Author

Still working on a more reliable test, but I wanted to try this out...

@brendandahl brendandahl marked this pull request as draft June 13, 2025 21:32
// lock if the current thread is not adding to the queue. We then hope the
// queue is executed later when it is unlocked.
// XXX: This could leave to starvation if we never process the queue.
if (q->active_thread == pthread_self()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if (q->active_thread && q->active_thread == pthread_self()) { to avoid calling pthread_self at all in the non-contended case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could combine with the above executing_system_queue check?

At least we could use the same technique of using a TLS variable instead of calling pthread_self? I guess that only works if this issue only applies to the system queue (which I think it does).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least maybe wrap this whole thing in the is_system_queue check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a few different ways to do this and simplify it. I just wanted to try this out with minimal changes first to make sure it's going down the right path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to reusing the existing thread-local variable once this is confirmed to be the right direction. It can be hoisted to the top level for use from multiple functions if necessary.

test_pthread_cancel was intermittently failing with a deadlock. From the
backtrace, I found we were adding to the queue which can sometimes call
malloc->emscripten_yield->emscripten_proxy_execute_queue which
also tries to lock the same queue on the same thread.
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Hopefully we can also figure out some way to improve the testing ?

@@ -35,6 +35,8 @@ static em_proxying_queue system_proxying_queue = {
.capacity = 0,
};

static _Thread_local int system_queue_in_use = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use bool here along with true/false but maybe thats a bigger/unnecessary change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason I decided to use int instead of bool when I wrote all this, so this is at least consistent with the rest of the file.

"a.html": 17199,
"a.html.gz": 7525,
"total": 17199,
"total_gz": 7525
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this is a codesize win!?

Is it perhaps move the TLS init code into the global scope? If so that is kind of crazy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange it looks like the code size test is still failing. Maybe this isn't actually a code size win. Will re-try again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Function-scope TLS has to do all sorts of implicitly thread-safe lazy initialization that global-scope TLS does not have to do.

This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (4) test expectation files were updated by
running the tests with `--rebaseline`:

```
test/other/codesize/test_codesize_minimal_pthreads.funcs updated
other/codesize/test_codesize_minimal_pthreads.size: 19396 => 19446 [+50 bytes / +0.26%]
test/other/codesize/test_codesize_minimal_pthreads_memgrowth.funcs updated
other/codesize/test_codesize_minimal_pthreads_memgrowth.size: 19397 => 19447 [+50 bytes / +0.26%]

Average change: +0.26% (+0.26% - +0.26%)
```
@@ -35,6 +35,8 @@ static em_proxying_queue system_proxying_queue = {
.capacity = 0,
};

static _Thread_local int system_queue_in_use = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason I decided to use int instead of bool when I wrote all this, so this is at least consistent with the rest of the file.

Comment on lines +123 to +124
// potentially try to execute the queue during the add (from
// emscripten_yield). This will deadlock the thread, so only try to take the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// potentially try to execute the queue during the add (from
// emscripten_yield). This will deadlock the thread, so only try to take the
// potentially try to execute the queue during the add (from
// emscripten_yield when malloc takes a lock). This will deadlock the thread, so only try to take the

"a.html": 17199,
"a.html.gz": 7525,
"total": 17199,
"total_gz": 7525
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Function-scope TLS has to do all sorts of implicitly thread-safe lazy initialization that global-scope TLS does not have to do.

@@ -72,6 +72,7 @@ $emscripten_futex_wake
$emscripten_stack_get_current
$emscripten_stack_set_limits
$free_ctx
$get_or_add_tasks_for_thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what happened here.

@tlively
Copy link
Member

tlively commented Jun 16, 2025

Solution LGTM, but I forgot about the testing.

@brendandahl
Copy link
Collaborator Author

Changing to using a thread local causes failures in test_pthread_run_on_main_thread_minimal_runtime. I didn't get very far debugging this. I won't be able to work on this until June ~25th, so if anyone wants to pick it up feel free to.

sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 17, 2025
Also use `bool` rather than `int` to better convey intent.

Split out from emscripten-core#24565
sbc100 added a commit to sbc100/emscripten that referenced this pull request Jun 17, 2025
Also use `bool` rather than `int` to better convey intent.

Split out from emscripten-core#24565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants