Description
We're seeing some odd behavior on OSX in rust-lang/rust right now. We're using sccache to compile LLVM, which executes a whole lot of tokio + mio on each run. We've also got a lot of spurious failures, one of the most notorious right now is "broken pipe", which arises during this spawn
.
From here on out, a lot of this is guesswork unfortunately. I've never been able to personally reproduce this, but here's my hunches. The main error happening here is that spawning a process somehow returns EPIPE
. Now that's pretty odd because reviewing the syscalls that Command::spawn
takes it turns out none of them can return EPIPE
! I would also find it quite odd if this is the first time we find this error in process spawning, surely it seems like it would have happened before now I'd guess.
So digging more the spawn
here actually translates to CommandExt::spawn_async
. Note that the function there can fail for more than one reason, not just Command::spawn
. The register_*
functions are also fallible. They can either fail in fcntl
(which I don't think can return EPIPE) or they can fail in PollEvented::new
which internally bottoms out to registering an object.
So at this point my best suspicion is that register
is returning EPIPE, although I have no idea why. One interesting observation is that we've only witnessed this on OSX, not on Linux or Windows. That itself is quite interesting as well! Trawling around for information on the internet related to this I stumbled across a commit in libevent which translates EPIPE from kevent
to a writable notification. (e.g. doesn't punt it back to the user). That commit claims that if you register half of a broken pipe you'll get an EPIPE. Sure enough it looks like xnu does return EPIPE on a broken pipe.
So that sounds great and all until it turns out I can't even reproduce this with a broken pipe. That program works successfully! (no errors)
So all in all there's something that's:
- OSX specific
- Likely related to tokio/mio, but not 100% certain
- Likely related to
Poll::register
, but not certain - Likely related to broken pipes (as this only happens under high load)
Where this leaves me I'm not entirely sure. I wanted to jot down thoughts and ideas to make sure they're written, but I'm curious what you think as well @carllerche. I could add some code to hopefully handle EPIPE
and somehow translate it to writable/readable (as appropriate), but I have no way of testing such a change. We could deploy it to rust-lang/rust and pray it fixes things, but we wouldn't really know for a few weeks whether it actually worked.