Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

bmeynell · 2018-11-08T17:32:08Z

Current Behavior

w worker registers that it can work f function for t time.
- In other words., w advertises its ability to CAN_DO_TIMEOUT instead of CAN_DO.
w receives j job.
Time w takes to process j exceeds t.
w does not fail j.
w sits idle for exactly 1000 seconds.
After 1000 seconds, w DOES fail j.

Expected Behavior

w worker registers that it can work f function for t time (milliseconds).
- In other words., w advertises its ability to CAN_DO_TIMEOUT instead of CAN_DO.
w receives j job.
Time w takes to process j exceeds t.
w fails j.

References

Launchpad bug report.
ext-gearman (PHP binding) bug report.

CAN_DO:

gearmand/PROTOCOL

Lines 318 to 325 in 7d6f757

    
           CAN_DO 
        
               This is sent to notify the server that the worker is able to 
        
               perform the given function. The worker is then put on a list to be 
        
               woken up whenever the job server receives a job for that function. 
        
               Arguments: 
        
               - Function name.

CAN_DO_TIMEOUT:

gearmand/PROTOCOL

Lines 327 to 335 in 7d6f757

    
           CAN_DO_TIMEOUT 
        
                Same as CAN_DO, but with a timeout value on how long the job 
        
                is allowed to run. After the timeout value, the job server will 
        
                mark the job as failed and notify any listening clients. 
        
                Arguments: 
        
                - NULL byte terminated Function name. 
        
                - Timeout value.

1000 Seconds:

gearmand/libgearman-server/connection.cc

Lines 690 to 693 in 7d6f757

    
           if (worker->timeout < 1000) 
        
           { 
        
             worker->timeout= 1000; 
        
           }

Timeout Handling:

gearmand/libgearman-server/connection.cc

Lines 688 to 728 in e2d76cf

    
                 if (worker->timeout > 0) 
        
                 { 
        
                   if (worker->timeout < 1000) 
        
                   { 
        
                     worker->timeout= 1000; 
        
                   } 
        
                   gearmand_log_debug(GEARMAN_DEFAULT_LOG_PARAM, "Adding timeout on %s for %s (%dl)", 
        
                                      job->function->function_name, 
        
                                      job->job_handle, 
        
                                      worker->timeout); 
        
                   if (con->timeout_event == NULL) 
        
                   { 
        
                     gearmand_con_st *dcon= con->con.context; 
        
                     con->timeout_event= (struct event *)malloc(sizeof(struct event)); // libevent POD 
        
                     if (con->timeout_event == NULL) 
        
                     { 
        
                       return gearmand_merror("malloc(sizeof(struct event)", struct event, 1); 
        
                     } 
        
                     timeout_set(con->timeout_event, _server_job_timeout, job); 
        
                     if (event_base_set(dcon->thread->base, con->timeout_event) == -1) 
        
                     { 
        
                       gearmand_perror(errno, "event_base_set"); 
        
                     } 
        
                   } 
        
                   /* XXX Right now, if a worker has diff timeouts for functions I think 
        
                     this will overwrite any existing timeouts on that event. One 
        
                     solution to that would be to record the timeout from last time, 
        
                     and only set this one if it is longer than that one. */ 
        
                   struct timeval timeout_tv = { 0 , 0 }; 
        
                   timeout_tv.tv_sec= worker->timeout; 
        
                   timeout_add(con->timeout_event, &timeout_tv); 
        
                 } 
        
                 else if (con->timeout_event) // Delete the timer if it exists 
        
                 { 
        
                   gearman_server_con_delete_timeout(con); 
        
                 } 
        
               } 
        
             }

CAN_DO_TIMEOUT as handled by the Perl Server.

Note: All timeouts alluded to above have NOTHING to do with SOCKET TIMEOUTS.

The text was updated successfully, but these errors were encountered:

esabol · 2018-11-08T18:09:31Z

According to the API, is the timeout supposed to be given in milliseconds or seconds?

If milliseconds, the error is at

gearmand/libgearman-server/connection.cc

Line 720 in 7d6f757

timeout_tv.tv_sec= worker->timeout;

where the value in milliseconds is improperly converted to a struct timeval. Need to do something like this instead:

int milliseconds = worker->timeout
struct timeval timeout_tv;
timeout_tv.tv_sec = milliseconds / 1000;
timeout_tv.tv_usec = (milliseconds % 1000) * 1000;

Or, if we go with a floating point number of seconds, something like this:

#include <math.h>
// ...
timeout_tv->tv_sec = (time_t)(worker->timeout);
timeout_tv->tv_usec = (suseconds_t)(modf(worker->timeout) * 1000000.0 + 0.5);

bmeynell · 2018-11-09T20:22:41Z

@esabol AFAIK the protocol, even back in the original Danga days, never stated what unit the timeout should be in.

Perl gearmand uses Danga::Socket for time tracking and "fractional seconds" are acceptable input.

A while back I tried to reach out to Brad (not @bradfitz - the other one) because IIRC @dormando thought he had been the one to incorporate CAN_DO_TIMEOUT originally, and could speak to the unit question. Unfortunately, I was never able to make contact.

Personally, I'm 👍 for milliseconds. I see no reason to prefer a larger unit.

There are BC considerations here too because some bindings may pass seconds while others milliseconds. I don't think it's a huge deal because the functionality is broken anyway, but certainly thought should be given as to how to communicate the potential BC break for some bindings. Example: the PHP binding defines the input as seconds. I've spoken to the maintainer (@wcgallego) as his preference (a little thread here) is milliseconds as well and would proceed with updating that particular binding when the underlying fix occurs in the C daemon.

Additional community consensus would be helpful too.

esabol · 2018-11-09T20:40:09Z

Milliseconds makes sense for a connection timeout, but not for a job timeout, IMHO. I would favor seconds.

@p-alik, does Perl Gearman::Worker use seconds?

esabol · 2018-11-13T14:11:30Z

@SpamapS, @p-alik, @BrianAker: Any thoughts on this?

bradfitz · 2018-11-13T15:55:35Z

Given that the timeout is transmitted in ASCII (not as some big endian uint32 or uint64) in the protocol, and because of the original Danga implementation, my vote is to define it as fractional seconds. If you want milliseconds, you can write 100ms as 0.1 on the wire.

esabol · 2018-11-27T17:31:13Z

I note the following:

All of the examples appear to indicate that the timeout is in milliseconds, such as this:

gearmand/examples/wc_worker.cc

Line 71 in 6d26236

("timeout,u", boost::program_options::value<int>(&timeout)->default_value(-1), "Timeout in milliseconds")
Throughout the code, the timeout is inconsistently represented by an integer-ish type, either int32_t or int or long, as in here:

gearmand/libgearman-server/server.cc

Line 546 in 6d26236

long timeout= strtol(strtol_buffer, &endptr, 10);

So it seems like @BrianAker's intention was for the timeout to be in milliseconds?

Changing the timeout to be a double (in order to support decimal seconds, as implemented by @bradfitz in the Danga server) throughout the code would be a big change to the code. I think the better solution is to parse it as seconds (into a double) and then convert that double value to a long in milliseconds. That would entail fewer changes to the code.

Or we could keep it as milliseconds and just fix the bug in the conversion of milliseconds to struct timeval.

Either way, fixing it is easy, but deciding what to do is difficult.

Anyone want to weigh in? @p-alik ?

… struct timeval

… struct timeval - untabify and update PROTOCOL

… struct timeval - minor tweaks

… struct timeval

esabol · 2019-10-29T04:44:02Z

Or we could keep it as milliseconds and just fix the bug in the conversion of milliseconds to struct timeval.

Almost a year later(!), I just submitted PR #250 which assumes the timeout is specified in milliseconds and fixes the conversion of milliseconds to a timeval. It seems the simplest course of action to take to fix this bug.

Address Issue #196 by fixing the conversion of milliseconds to timeval

esabol · 2019-11-28T07:21:10Z

Unless you want to wait until the next version of gearmand is released, this issue can be closed.

Nibbels · 2021-12-01T14:00:21Z

We had a working gearman setup with 1.1.18 running.
We use PHP https://github.com/brianlmoon/GearmanManager + /net_gearman
The configuration .ini inherited this line:

; Timeout n seconds for all jobs before work is reissued to another worker
timeout = 300

Then the version was increased to 1.1.19 because the system got patched by an admin. Afterwards most of the workers did still run successfully, but ...

If the work of a worker took longer than 1 second the client did read "work_failed" (14) from the socket.
The worker however continued its work and completed it successfully. But at the time of an successful result, the client was not listening anymore. Each job looked like a fail.

After days of debugging we found this thread.
You did change the "seconds" to milliseconds. So the actual fix was:

; Timeout n milliseconds for all jobs before work is reissued to another worker
timeout = 300000

I hope with this response other affecteds will find this thread via google.

SpamapS · 2021-12-01T16:33:29Z

I will take a look at the code later today but if we did that it is a regression and we need to revert and release a fix. Thanks for reporting, sorry for the trouble.

…

On Wed, Dec 1, 2021, 6:00 AM Stefan Bek ***@***.***> wrote: We had a working gearman setup with 1.1.18 running. We use PHP https://github.com/brianlmoon/GearmanManager + /net_gearman The configuration .ini inherited this line: ; Timeout n seconds for all jobs before work is reissued to another worker timeout = 300 Then the version was increased to 1.1.19 because the system got patched by an admin. Afterwards most of the workers did still run successfully, but ... If the work of a worker took longer than 1 second the client did read "work_failed" (14) from the socket. The worker however continued its work and completed it successfully. But at the time of an successful result, the client was not listening anymore. Each job looked like a fail. After days of debugging we found this thread. You did change the "seconds" to milliseconds. So the actual fix was: ; Timeout n milliseconds for all jobs before work is reissued to another worker timeout = 300000 I hope with this response other affecteds will find this thread via google. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS6YBT5CDGNGJ57KXX4JTUOYTADANCNFSM4GCUSWAA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

esabol · 2021-12-01T17:21:07Z

The units for the timeout value were never documented, so it can't be a regression, @SpamapS. The closest thing to documentation was an example, and the example program's options indicated milliseconds. The change was intentional and made after a long discussion period seeking input from major stakeholders as to what the units should be, and the pull request was approved by both you and p-alik.

SpamapS · 2021-12-01T18:48:50Z

My memory of the last 2 years is challenged by multiple work/life balance traumas. I'll take your word for it. Ugh.

…

On Wed, Dec 1, 2021, 9:21 AM Ed Sabol ***@***.***> wrote: The units for the timeout value were never documented or specified, so it can't be a regression, @SpamapS <https://github.com/SpamapS>. The change was intentional and made after a long discussion period seeking input and was approved by both you and p-alik. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS6YBBHFP4ZHNUOO7ZSTLUOZKQ5ANCNFSM4GCUSWAA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

brianlmoon · 2021-12-01T20:18:02Z

The units for the timeout value were never documented, so it can't be a regression, @SpamapS. The closest thing to documentation was an example, and the example program's options indicated milliseconds. The change was intentional and made after a long discussion period seeking input from major stakeholders as to what the units should be, and the pull request was approved by both you and p-alik.

While it should not be a regression, I would have expected a minor point release. Seems like this kind of change should have gone into 1.2.x at the least (if not 2.x) and not in the 1.1.x release series. It is BC breaking. Clients will now have to check the server version before knowing the correct timeout value to send. That is a protocol change in this version of the server IMO.

In gearmand version 1.1.19 and greater, the timeout is expected to be in milliseconds. Before that version, it is expected to be in seconds. gearman/gearmand#196

esabol · 2021-12-01T23:49:03Z

@brianlmoon: If you read the original issue at the top of this page, timeouts didn't work at all prior to this change. This was a bug fix. Workers always waited at least 1000 seconds before timing out because the code that set the timeout was wrong, regardless of whether the value as provided was considered to be seconds or milliseconds. In addition to @bmeynell who reported the issue, I tested it myself, and that's what I saw. If you experienced it actually working, then it was a fluke and not what anyone here has seen. The code was wrong both ways. To fix it properly, a decision had to be made as to whether to interpret the value as seconds or milliseconds. Maybe the minor version number should have been incremented, but that boat sailed two years ago.

P.S. The timeout implementation is very buggy, and I wouldn't recommend using it at all. But that's a separate issue....

SpamapS · 2021-12-04T22:04:05Z

Sorry @esabol I had forgotten that whole saga. :) Indeed, I think bumping beyond patch level might have been a good idea given that those numbers were being ignored and had been documented ambiguously, and we made some opinionated decisions about what they meant.

But what's done is done, and changing it now would likely do as much harm as we already did. Sorry for the trouble @Nibbels, and it's good to see you still chugging away with Gearman @brianlmoon!

bmeynell mentioned this issue Nov 8, 2018

Unexpected Job Retry Behavior After Worker Fails Job Due to CAN_DO_TIMEOUT Timeout #197

Open

p-alik added the bug label Nov 28, 2018

This was referenced Dec 11, 2018

Redis queue reload fails with spaces in unique ID #216

Closed

Per job retry count #116

Open

p-alik mentioned this issue Dec 16, 2018

The feature of persistence plugins #227

Open

p-alik added the core label Dec 27, 2018

esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019

Address Issue gearman#196 by fixing the conversion of milliseconds to…

22279bc

… struct timeval

esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019

Address Issue gearman#196 by fixing the conversion of milliseconds to…

bb89f78

… struct timeval - untabify and update PROTOCOL

esabol mentioned this issue Oct 29, 2019

Address Issue #196 by fixing the conversion of milliseconds to timeval #250

Merged

esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019

Address Issue gearman#196 by fixing the conversion of milliseconds to…

410b544

… struct timeval - minor tweaks

esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019

Address Issue gearman#196 by fixing the conversion of milliseconds to…

cfa0585

… struct timeval

SpamapS added a commit that referenced this issue Nov 5, 2019

Merge pull request #250 from esabol/fix-issue-196-timeout-milliseconds

a229661

Address Issue #196 by fixing the conversion of milliseconds to timeval

p-alik closed this as completed Nov 28, 2019

esabol mentioned this issue Aug 28, 2020

gearmand hangs intermittently when timeouts are used #301

Open

Nibbels mentioned this issue Dec 1, 2021

timeout is milliseconds in gearmand 1.1.19 brianlmoon/GearmanManager#159

Open

brianlmoon mentioned this issue Dec 1, 2021

Check server version for timeout lenn0x/net_gearman#13

Closed

brianlmoon mentioned this issue Dec 1, 2021

Check server version for timeout unit brianlmoon/net_gearman#49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

bmeynell commented Nov 8, 2018

esabol commented Nov 8, 2018 •

edited

Loading

bmeynell commented Nov 9, 2018 •

edited

Loading

esabol commented Nov 9, 2018

esabol commented Nov 13, 2018

bradfitz commented Nov 13, 2018

esabol commented Nov 27, 2018 •

edited

Loading

esabol commented Oct 29, 2019 •

edited

Loading

esabol commented Nov 28, 2019 •

edited

Loading

Nibbels commented Dec 1, 2021

SpamapS commented Dec 1, 2021 via email

esabol commented Dec 1, 2021 •

edited

Loading

SpamapS commented Dec 1, 2021 via email

brianlmoon commented Dec 1, 2021 •

edited

Loading

esabol commented Dec 1, 2021

SpamapS commented Dec 4, 2021

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

Comments

bmeynell commented Nov 8, 2018

Current Behavior

Expected Behavior

References

esabol commented Nov 8, 2018 • edited Loading

bmeynell commented Nov 9, 2018 • edited Loading

esabol commented Nov 9, 2018

esabol commented Nov 13, 2018

bradfitz commented Nov 13, 2018

esabol commented Nov 27, 2018 • edited Loading

esabol commented Oct 29, 2019 • edited Loading

esabol commented Nov 28, 2019 • edited Loading

Nibbels commented Dec 1, 2021

SpamapS commented Dec 1, 2021 via email

esabol commented Dec 1, 2021 • edited Loading

SpamapS commented Dec 1, 2021 via email

brianlmoon commented Dec 1, 2021 • edited Loading

esabol commented Dec 1, 2021

SpamapS commented Dec 4, 2021

esabol commented Nov 8, 2018 •

edited

Loading

bmeynell commented Nov 9, 2018 •

edited

Loading

esabol commented Nov 27, 2018 •

edited

Loading

esabol commented Oct 29, 2019 •

edited

Loading

esabol commented Nov 28, 2019 •

edited

Loading

esabol commented Dec 1, 2021 •

edited

Loading

brianlmoon commented Dec 1, 2021 •

edited

Loading