Skip to content

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmeynell opened this issue Nov 8, 2018 · 15 comments
Closed

Unexpected Worker Timeout (CAN_DO_TIMEOUT) Behavior #196

bmeynell opened this issue Nov 8, 2018 · 15 comments

Comments

@bmeynell
Copy link

bmeynell commented Nov 8, 2018

Current Behavior

  1. w worker registers that it can work f function for t time.
    • In other words., w advertises its ability to CAN_DO_TIMEOUT instead of CAN_DO.
  2. w receives j job.
  3. Time w takes to process j exceeds t.
  4. w does not fail j.
  5. w sits idle for exactly 1000 seconds.
  6. After 1000 seconds, w DOES fail j.

Expected Behavior

  1. w worker registers that it can work f function for t time (milliseconds).
    • In other words., w advertises its ability to CAN_DO_TIMEOUT instead of CAN_DO.
  2. w receives j job.
  3. Time w takes to process j exceeds t.
  4. w fails j.

References

  • Launchpad bug report.
  • ext-gearman (PHP binding) bug report.
  • CAN_DO:

    gearmand/PROTOCOL

    Lines 318 to 325 in 7d6f757

    CAN_DO
    This is sent to notify the server that the worker is able to
    perform the given function. The worker is then put on a list to be
    woken up whenever the job server receives a job for that function.
    Arguments:
    - Function name.
  • CAN_DO_TIMEOUT:

    gearmand/PROTOCOL

    Lines 327 to 335 in 7d6f757

    CAN_DO_TIMEOUT
    Same as CAN_DO, but with a timeout value on how long the job
    is allowed to run. After the timeout value, the job server will
    mark the job as failed and notify any listening clients.
    Arguments:
    - NULL byte terminated Function name.
    - Timeout value.
  • 1000 Seconds:
    if (worker->timeout < 1000)
    {
    worker->timeout= 1000;
    }
  • Timeout Handling:
    if (worker->timeout > 0)
    {
    if (worker->timeout < 1000)
    {
    worker->timeout= 1000;
    }
    gearmand_log_debug(GEARMAN_DEFAULT_LOG_PARAM, "Adding timeout on %s for %s (%dl)",
    job->function->function_name,
    job->job_handle,
    worker->timeout);
    if (con->timeout_event == NULL)
    {
    gearmand_con_st *dcon= con->con.context;
    con->timeout_event= (struct event *)malloc(sizeof(struct event)); // libevent POD
    if (con->timeout_event == NULL)
    {
    return gearmand_merror("malloc(sizeof(struct event)", struct event, 1);
    }
    timeout_set(con->timeout_event, _server_job_timeout, job);
    if (event_base_set(dcon->thread->base, con->timeout_event) == -1)
    {
    gearmand_perror(errno, "event_base_set");
    }
    }
    /* XXX Right now, if a worker has diff timeouts for functions I think
    this will overwrite any existing timeouts on that event. One
    solution to that would be to record the timeout from last time,
    and only set this one if it is longer than that one. */
    struct timeval timeout_tv = { 0 , 0 };
    timeout_tv.tv_sec= worker->timeout;
    timeout_add(con->timeout_event, &timeout_tv);
    }
    else if (con->timeout_event) // Delete the timer if it exists
    {
    gearman_server_con_delete_timeout(con);
    }
    }
    }
  • CAN_DO_TIMEOUT as handled by the Perl Server.

Note: All timeouts alluded to above have NOTHING to do with SOCKET TIMEOUTS.

@esabol
Copy link
Member

esabol commented Nov 8, 2018

According to the API, is the timeout supposed to be given in milliseconds or seconds?

If milliseconds, the error is at

timeout_tv.tv_sec= worker->timeout;

where the value in milliseconds is improperly converted to a struct timeval. Need to do something like this instead:

int milliseconds = worker->timeout
struct timeval timeout_tv;
timeout_tv.tv_sec = milliseconds / 1000;
timeout_tv.tv_usec = (milliseconds % 1000) * 1000;

Or, if we go with a floating point number of seconds, something like this:

#include <math.h>
// ...
timeout_tv->tv_sec = (time_t)(worker->timeout);
timeout_tv->tv_usec = (suseconds_t)(modf(worker->timeout) * 1000000.0 + 0.5);

@bmeynell
Copy link
Author

bmeynell commented Nov 9, 2018

@esabol AFAIK the protocol, even back in the original Danga days, never stated what unit the timeout should be in.

Perl gearmand uses Danga::Socket for time tracking and "fractional seconds" are acceptable input.

A while back I tried to reach out to Brad (not @bradfitz - the other one) because IIRC @dormando thought he had been the one to incorporate CAN_DO_TIMEOUT originally, and could speak to the unit question. Unfortunately, I was never able to make contact.

Personally, I'm 👍 for milliseconds. I see no reason to prefer a larger unit.

There are BC considerations here too because some bindings may pass seconds while others milliseconds. I don't think it's a huge deal because the functionality is broken anyway, but certainly thought should be given as to how to communicate the potential BC break for some bindings. Example: the PHP binding defines the input as seconds. I've spoken to the maintainer (@wcgallego) as his preference (a little thread here) is milliseconds as well and would proceed with updating that particular binding when the underlying fix occurs in the C daemon.

Additional community consensus would be helpful too.

@esabol
Copy link
Member

esabol commented Nov 9, 2018

Milliseconds makes sense for a connection timeout, but not for a job timeout, IMHO. I would favor seconds.

@p-alik, does Perl Gearman::Worker use seconds?

@esabol
Copy link
Member

esabol commented Nov 13, 2018

@SpamapS, @p-alik, @BrianAker: Any thoughts on this?

@bradfitz
Copy link

Given that the timeout is transmitted in ASCII (not as some big endian uint32 or uint64) in the protocol, and because of the original Danga implementation, my vote is to define it as fractional seconds. If you want milliseconds, you can write 100ms as 0.1 on the wire.

@esabol
Copy link
Member

esabol commented Nov 27, 2018

I note the following:

  • All of the examples appear to indicate that the timeout is in milliseconds, such as this:

    ("timeout,u", boost::program_options::value<int>(&timeout)->default_value(-1), "Timeout in milliseconds")

  • Throughout the code, the timeout is inconsistently represented by an integer-ish type, either int32_t or int or long, as in here:

    long timeout= strtol(strtol_buffer, &endptr, 10);

So it seems like @BrianAker's intention was for the timeout to be in milliseconds?

Changing the timeout to be a double (in order to support decimal seconds, as implemented by @bradfitz in the Danga server) throughout the code would be a big change to the code. I think the better solution is to parse it as seconds (into a double) and then convert that double value to a long in milliseconds. That would entail fewer changes to the code.

Or we could keep it as milliseconds and just fix the bug in the conversion of milliseconds to struct timeval.

Either way, fixing it is easy, but deciding what to do is difficult.

Anyone want to weigh in? @p-alik ?

@p-alik p-alik added the bug label Nov 28, 2018
@p-alik p-alik added the core label Dec 27, 2018
esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019
esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019
… struct timeval - untabify and update PROTOCOL
esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019
esabol added a commit to esabol/gearmand that referenced this issue Oct 29, 2019
@esabol
Copy link
Member

esabol commented Oct 29, 2019

Or we could keep it as milliseconds and just fix the bug in the conversion of milliseconds to struct timeval.

Almost a year later(!), I just submitted PR #250 which assumes the timeout is specified in milliseconds and fixes the conversion of milliseconds to a timeval. It seems the simplest course of action to take to fix this bug.

SpamapS added a commit that referenced this issue Nov 5, 2019
Address Issue #196 by fixing the conversion of milliseconds to timeval
@esabol
Copy link
Member

esabol commented Nov 28, 2019

Unless you want to wait until the next version of gearmand is released, this issue can be closed.

@Nibbels
Copy link

Nibbels commented Dec 1, 2021

We had a working gearman setup with 1.1.18 running.
We use PHP https://github.com/brianlmoon/GearmanManager + /net_gearman
The configuration .ini inherited this line:

; Timeout n seconds for all jobs before work is reissued to another worker
timeout = 300

Then the version was increased to 1.1.19 because the system got patched by an admin. Afterwards most of the workers did still run successfully, but ...

If the work of a worker took longer than 1 second the client did read "work_failed" (14) from the socket.
The worker however continued its work and completed it successfully. But at the time of an successful result, the client was not listening anymore. Each job looked like a fail.

After days of debugging we found this thread.
You did change the "seconds" to milliseconds. So the actual fix was:

; Timeout n milliseconds for all jobs before work is reissued to another worker
timeout = 300000

I hope with this response other affecteds will find this thread via google.

@SpamapS
Copy link
Member

SpamapS commented Dec 1, 2021 via email

@esabol
Copy link
Member

esabol commented Dec 1, 2021

The units for the timeout value were never documented, so it can't be a regression, @SpamapS. The closest thing to documentation was an example, and the example program's options indicated milliseconds. The change was intentional and made after a long discussion period seeking input from major stakeholders as to what the units should be, and the pull request was approved by both you and p-alik.

@SpamapS
Copy link
Member

SpamapS commented Dec 1, 2021 via email

@brianlmoon
Copy link

brianlmoon commented Dec 1, 2021

The units for the timeout value were never documented, so it can't be a regression, @SpamapS. The closest thing to documentation was an example, and the example program's options indicated milliseconds. The change was intentional and made after a long discussion period seeking input from major stakeholders as to what the units should be, and the pull request was approved by both you and p-alik.

While it should not be a regression, I would have expected a minor point release. Seems like this kind of change should have gone into 1.2.x at the least (if not 2.x) and not in the 1.1.x release series. It is BC breaking. Clients will now have to check the server version before knowing the correct timeout value to send. That is a protocol change in this version of the server IMO.

brianlmoon added a commit to brianlmoon/net_gearman that referenced this issue Dec 1, 2021
In gearmand version 1.1.19 and greater, the timeout is
expected to be in milliseconds. Before that version, it
is expected to be in seconds.
gearman/gearmand#196
@esabol
Copy link
Member

esabol commented Dec 1, 2021

@brianlmoon: If you read the original issue at the top of this page, timeouts didn't work at all prior to this change. This was a bug fix. Workers always waited at least 1000 seconds before timing out because the code that set the timeout was wrong, regardless of whether the value as provided was considered to be seconds or milliseconds. In addition to @bmeynell who reported the issue, I tested it myself, and that's what I saw. If you experienced it actually working, then it was a fluke and not what anyone here has seen. The code was wrong both ways. To fix it properly, a decision had to be made as to whether to interpret the value as seconds or milliseconds. Maybe the minor version number should have been incremented, but that boat sailed two years ago.

P.S. The timeout implementation is very buggy, and I wouldn't recommend using it at all. But that's a separate issue....

@SpamapS
Copy link
Member

SpamapS commented Dec 4, 2021

Sorry @esabol I had forgotten that whole saga. :) Indeed, I think bumping beyond patch level might have been a good idea given that those numbers were being ignored and had been documented ambiguously, and we made some opinionated decisions about what they meant.

But what's done is done, and changing it now would likely do as much harm as we already did. Sorry for the trouble @Nibbels, and it's good to see you still chugging away with Gearman @brianlmoon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants