Skip to content

async option will be dropped. #1522

Closed
@st0012

Description

@st0012

The async option is unique to the Ruby SDK. It was designed to help send events asynchronously through different backends (e.g. a Ruby thread, Sidekiq worker...etc.). Depends on the backend, it can pose thread to the system due to its additional extra memory consumption. So it's an option with some trade-offs.

But since version 4.1, the SDK now has its own background worker managed (implemented with the famous concurrent-ruby library). It can handle most of the async

The async Option Approach

  1. The SDK serializes the event and event hint into json-compatible Ruby hashes.
  2. It passes the event payload and hint to the block.
  3. In general, the block would enqueue a background job with the above data.
    • Some earlier apps use a new Ruby thread to send the data. This is unrecommended.
    • With background job libraries like Sidekiq or Resque, this means adding objects into Redis.
    • With delayed_job, this means adding a new delayed_job record.
  4. A background worker (e.g. Sidekiq worker) then picks the event and hint and send it.

Pros

Users can customize their event sending logic. But generally it's just a worker with Sentry.send_event(event, hint).

Cons

  • The event payload (usually dozens of kbs) could be copied twice: first copied to the medium storage and then allocates the background worker process.
  • When there is an event spike, it can flood the medium storage (Redis) and take down the entire system.

The Background Worker

  1. The SDK passes the event and its hint to the background worker (a pool of threads managed by concurrent-ruby).
  2. A worker then picks the event, serializes it, and sents it.

Pros

  • It doesn't allocate extra memory other than the original event payload.
  • It's faster.
  • It doesn't require any user code.
  • The background worker doesn't queue more than 30 events. So even when there's a spike, it's unlikely to consume all the memory.

Cons

  • Unsent events will die with the process. Generally speaking, the queue time in background worker is very low. And the chance of missing events due to this reason is small in web apps. But for script programs, the process often leaves before the worker is able to send the event. This is why hint: { background: false } is required in rake integrations.
    • However, I don't think this problem can be solved with the async option.

This drawback has been addressed in #1617.

Missing Events During A Spike Because of Queue Limit

I know many users have concern about the background worker's 30 events queue limit will make them lose events during a spike. But as the maintainer and a user of this SDK, I don't worry about it because:

  1. The spike is likely to be an urgent case, and that'll probably be fixed in a short time. So not seeing a few instances of other errors should not affect the overall coverage.
  2. Given these characteristics of the SDK's background worker:
    • The default number of background workers are determined by the number of process cores on your machine.
    • They're a lot faster than the using the async approach with a sidekiq/resque...etc. worker due to the reason I described in the issue.
    • A 30-event queue is only shared within the process/web instance, depends on the concurrency model you have. Not at a global level.
      If there's a spike big enough to overflow the SDK's queue and drop some events, it'll probably overflow your background job queue with the async option too and/or pose a greater damage to your system.
  3. Sentry has a rate-limiting mechanism to prevent overflow on the platform side, which works by both rejecting new events and telling the SDK not to send new events with a 429 response. When the SDK receives a 429 response from Sentry during a spike, it'll stop sending "all events" for a given period of time.

What I'm trying to say is, it's not possible to expect Sentry to accept "all events" during a big spike regardless which approach you use. But when a spike happens, async is more likely to become another bottleneck and/or cause other problems in your system.

My Opinion

The async option seems redundant now and it could sometimes cause more harm. So I think we should drop it in version 5.0.

Questions

The above analysis is only based on my personal usage of the SDK and a few cases I helped debug with. So if you're willing to share your experience, I'd like to know

Even though the decision has been made, we still would like to hear feedback about it:

  • Do you use the async option in your apps?
    • If you do, what's the motivation? Will you still use it after reading the above description?
    • If you don't, is it an intentional decision? If it is, what's the reason behind it?
  • Do you disable the background workers with the background_worker_threads config option?
    • If you do, why?
  • Or any feedback related to this topic.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions