-
Notifications
You must be signed in to change notification settings - Fork 139
Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to the API, is the timeout supposed to be given in milliseconds or seconds? If milliseconds, the error is at gearmand/libgearman-server/connection.cc Line 720 in 7d6f757
where the value in milliseconds is improperly converted to a struct timeval . Need to do something like this instead:
int milliseconds = worker->timeout
struct timeval timeout_tv;
timeout_tv.tv_sec = milliseconds / 1000;
timeout_tv.tv_usec = (milliseconds % 1000) * 1000; Or, if we go with a floating point number of seconds, something like this: #include <math.h>
// ...
timeout_tv->tv_sec = (time_t)(worker->timeout);
timeout_tv->tv_usec = (suseconds_t)(modf(worker->timeout) * 1000000.0 + 0.5); |
@esabol AFAIK the protocol, even back in the original Danga days, never stated what unit the timeout should be in. Perl gearmand uses Danga::Socket for time tracking and "fractional seconds" are acceptable input. A while back I tried to reach out to Brad (not @bradfitz - the other one) because IIRC @dormando thought he had been the one to incorporate Personally, I'm 👍 for milliseconds. I see no reason to prefer a larger unit. There are BC considerations here too because some bindings may pass seconds while others milliseconds. I don't think it's a huge deal because the functionality is broken anyway, but certainly thought should be given as to how to communicate the potential BC break for some bindings. Example: the PHP binding defines the input as seconds. I've spoken to the maintainer (@wcgallego) as his preference (a little thread here) is milliseconds as well and would proceed with updating that particular binding when the underlying fix occurs in the C daemon. Additional community consensus would be helpful too. |
Milliseconds makes sense for a connection timeout, but not for a job timeout, IMHO. I would favor seconds. @p-alik, does Perl Gearman::Worker use seconds? |
@SpamapS, @p-alik, @BrianAker: Any thoughts on this? |
Given that the timeout is transmitted in ASCII (not as some big endian uint32 or uint64) in the protocol, and because of the original Danga implementation, my vote is to define it as fractional seconds. If you want milliseconds, you can write 100ms as |
I note the following:
So it seems like @BrianAker's intention was for the timeout to be in milliseconds? Changing the timeout to be a double (in order to support decimal seconds, as implemented by @bradfitz in the Danga server) throughout the code would be a big change to the code. I think the better solution is to parse it as seconds (into a double) and then convert that double value to a long in milliseconds. That would entail fewer changes to the code. Or we could keep it as milliseconds and just fix the bug in the conversion of milliseconds to struct timeval. Either way, fixing it is easy, but deciding what to do is difficult. Anyone want to weigh in? @p-alik ? |
… struct timeval - untabify and update PROTOCOL
… struct timeval - minor tweaks
Almost a year later(!), I just submitted PR #250 which assumes the timeout is specified in milliseconds and fixes the conversion of milliseconds to a timeval. It seems the simplest course of action to take to fix this bug. |
Address Issue #196 by fixing the conversion of milliseconds to timeval
Unless you want to wait until the next version of gearmand is released, this issue can be closed. |
We had a working gearman setup with 1.1.18 running.
Then the version was increased to 1.1.19 because the system got patched by an admin. Afterwards most of the workers did still run successfully, but ... If the work of a worker took longer than 1 second the client did read "work_failed" (14) from the socket. After days of debugging we found this thread.
I hope with this response other affecteds will find this thread via google. |
I will take a look at the code later today but if we did that it is a
regression and we need to revert and release a fix. Thanks for reporting,
sorry for the trouble.
…On Wed, Dec 1, 2021, 6:00 AM Stefan Bek ***@***.***> wrote:
We had a working gearman setup with 1.1.18 running.
We use PHP https://github.com/brianlmoon/GearmanManager + /net_gearman
The configuration .ini inherited this line:
; Timeout n seconds for all jobs before work is reissued to another worker
timeout = 300
Then the version was increased to 1.1.19 because the system got patched by
an admin. Afterwards most of the workers did still run successfully, but ...
If the work of a worker took longer than 1 second the client did read
"work_failed" (14) from the socket.
The worker however continued its work and completed it successfully. But
at the time of an successful result, the client was not listening anymore.
Each job looked like a fail.
After days of debugging we found this thread.
You did change the "seconds" to milliseconds. So the actual fix was:
; Timeout n milliseconds for all jobs before work is reissued to another
worker
timeout = 300000
I hope with this response other affecteds will find this thread via google.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YBT5CDGNGJ57KXX4JTUOYTADANCNFSM4GCUSWAA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The units for the timeout value were never documented, so it can't be a regression, @SpamapS. The closest thing to documentation was an example, and the example program's options indicated milliseconds. The change was intentional and made after a long discussion period seeking input from major stakeholders as to what the units should be, and the pull request was approved by both you and p-alik. |
My memory of the last 2 years is challenged by multiple work/life balance
traumas. I'll take your word for it. Ugh.
…On Wed, Dec 1, 2021, 9:21 AM Ed Sabol ***@***.***> wrote:
The units for the timeout value were never documented or specified, so it
can't be a regression, @SpamapS <https://github.com/SpamapS>. The change
was intentional and made after a long discussion period seeking input and
was approved by both you and p-alik.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YBBHFP4ZHNUOO7ZSTLUOZKQ5ANCNFSM4GCUSWAA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
While it should not be a regression, I would have expected a minor point release. Seems like this kind of change should have gone into 1.2.x at the least (if not 2.x) and not in the 1.1.x release series. It is BC breaking. Clients will now have to check the server version before knowing the correct timeout value to send. That is a protocol change in this version of the server IMO. |
In gearmand version 1.1.19 and greater, the timeout is expected to be in milliseconds. Before that version, it is expected to be in seconds. gearman/gearmand#196
@brianlmoon: If you read the original issue at the top of this page, timeouts didn't work at all prior to this change. This was a bug fix. Workers always waited at least 1000 seconds before timing out because the code that set the timeout was wrong, regardless of whether the value as provided was considered to be seconds or milliseconds. In addition to @bmeynell who reported the issue, I tested it myself, and that's what I saw. If you experienced it actually working, then it was a fluke and not what anyone here has seen. The code was wrong both ways. To fix it properly, a decision had to be made as to whether to interpret the value as seconds or milliseconds. Maybe the minor version number should have been incremented, but that boat sailed two years ago. P.S. The timeout implementation is very buggy, and I wouldn't recommend using it at all. But that's a separate issue.... |
Sorry @esabol I had forgotten that whole saga. :) Indeed, I think bumping beyond patch level might have been a good idea given that those numbers were being ignored and had been documented ambiguously, and we made some opinionated decisions about what they meant. But what's done is done, and changing it now would likely do as much harm as we already did. Sorry for the trouble @Nibbels, and it's good to see you still chugging away with Gearman @brianlmoon! |
Current Behavior
w
worker registers that it can workf
function fort
time.w
advertises its ability toCAN_DO_TIMEOUT
instead ofCAN_DO
.w
receivesj
job.w
takes to processj
exceedst
.w
does not failj
.w
sits idle for exactly 1000 seconds.w
DOES failj
.Expected Behavior
w
worker registers that it can workf
function fort
time (milliseconds).w
advertises its ability toCAN_DO_TIMEOUT
instead ofCAN_DO
.w
receivesj
job.w
takes to processj
exceedst
.w
failsj
.References
ext-gearman
(PHP binding) bug report.CAN_DO
:gearmand/PROTOCOL
Lines 318 to 325 in 7d6f757
CAN_DO_TIMEOUT
:gearmand/PROTOCOL
Lines 327 to 335 in 7d6f757
gearmand/libgearman-server/connection.cc
Lines 690 to 693 in 7d6f757
gearmand/libgearman-server/connection.cc
Lines 688 to 728 in e2d76cf
CAN_DO_TIMEOUT
as handled by the Perl Server.The text was updated successfully, but these errors were encountered: