-
Notifications
You must be signed in to change notification settings - Fork 155
Overhaul event publication lifecycle #796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does not being able to tell that an event is being processed also mean that currently multi-instance apps are not an option? I’m not a database expert, but I believe at least PostgreSQL supports row-level locking, which would allow concurrent processing of events by multiple instances, unlike some leader election method. |
@breun The whole processing (sending event -> handler -> mark as finished) is not done by "submitting" the event to the table and a "worker" picks it up. The processing happens always on the instance the Event has been sent in the first place. The publication_log is just there to keep track of what events have been processed. And the only information you currently have is if there is a completion_date on the event and when it got published. We've built our own retry mechanism around the log, in which we are retrying events that are at least n-minutes old, but because it is an "publication_log" we have the issue that sometimes events get processed multiple times, when an event takes a long time to be processed (either because one of the steps take a long time or when a lot of events get sent and they can't get processed because all threads in our thread pool are already busy). And that's where this misconception comes into place. If you work around the fact that the table is not used for processing at all than you may get into these issues if you build your retry mechanism. My thoughts around this topicWe would really wish for a way to distinguish between events that are currently being processed and events that have failed, but in all implementations there are edge cases which you may or may not support in spring-modulith. If you have a dedicated status field (e.g. SUBMITTED, PROCESSING, FINSIHED, FAILED) you can easily find out which events to retry, based on the failed field and can skip the PROCESSING ones - unless you have events that are struck, because the instance went down when processing them. In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active. If you handle it via a failedDate column you have to identify the currently being processed ones via an offset (as described in the issue description) - but here you have to be careful with longer running tasks, because as i mentioned, it can happen that it takes a few minutes until an event is picked up (because of all threads are being utilized) In that case it could make sense to also have an column for when the Event got picked up and the handler is being triggered.... ConclusionThinking more about it, a big problem with the event_publication table is the misconception I mentioned. For example I expected that the event_publication could be seen as a "light" version of an event-externalization, but it does definetely does not work that way (and probably shouldn't be used in that way). From my gut feeling (and talking with colleagues about that) it feels like that I am not the only one that stepped into that. Maybe I am jumping a little bit to a different topic, where this issue isn't about, but I think it should be clearer from the docs that the current event_publication mechanism should not be seen as as an externalized event processing mechanism and should show up the limitations of that. And regarding what the users are expecting (and what @breun even mentioned) Edit:
I just took a look into the docs and Event-Externalization means that you just publish events to other systems so that other applications can get them - not that you consume them from these. |
It is unclear to me what Modulith's responsibility should be: ensuring the delivery of events only or also dealing with problems. To ensure the delivery of events, For handling problems, Therefore, I think Modulith should focus on the delivery side of things, for example, by better tracking the status (event queued, event processing, …) and providing better docs, and perhaps some callbacks, on how to deal with event delivery problems, like event classes that have been removed or event listeners that no longer exist. |
Event publication in modulith is simple and I think(my opinion) that expanding it will get you into a scope creep and you will not know when to stop. Thinking about the problem differently, instead of an event handler that is recorded(and committed) as part of a transaction, what about a job getting scheduled. Now you are solving a different problem, running background jobs, which we have many solutions for, and the typically solve:
I'm experimenting with db-scheduler and it does exactly that. |
Does this mean you assume that only one instance in a multi-instance is actively doing work? I’m still trying to wrap my head around this concept and its implications. I have the feeling I don’t grasp it completely yet, but my main feeling is now that it feels ‘dangerous’ to use this for high-traffic multi-instance services, which is a typical use case for me. I’ve seen most Spring Modulith talks promoting using events instead of direct method calls across module boundaries, and I see the benefits of decoupling via events, but I feel that this event publication mechanism could become a performance/availability risk when it becomes a dependency for every call to a high-traffic module method that publishes an event for other module’s to listen for. In that sense a direct method feels ‘safer’ than using events: no database that could have a full table, or otherwise impact performance or availability. Is this an irrational fear I have of the Spring Modulith event publication mechanism? Should I just configure Modulith to auto-remove completed events in a high-traffic scenario to avoid the event log from growing indefinitely and not worry about it too much otherwise? |
Including a |
@odrotbohm I think the general proposal to store information about the event lifecycle is a very good one. In particular, I think it brings a brilliant opportunity to solve some significant problems with the current version of Spring Modulith. Based on your conversation with @breun above and other discussions (#402, #727), I've understood that:
This situation causes several problems:
On a side note, I should point out I found none of the above information in the current Spring Modulith documentation, and the discussions I mentioned above seem to show that others also did not realise that Spring Modulith worked like this. So, my first suggestion is that the current Modulith documentation really needs revising to add these details. Perhaps (just a guess) this stuff is obvious to some people (I don't know, maybe that's always how Spring Application Events have worked...) but it wasn't obvious to me and it's not spelled out in the documentation. Given how common horizontal scaling is, I would say the documentation needs updating urgently because otherwise the effects might be catastrophic. But, it seems to me that adding state information about the lifecycle events provides just the opportunity you need to solve ALL of the above problems - and to do be able to do it out of the box, without needing anything like schedlock. From @odrotbohm 's draft state diagrams above, we see that events arrive into the database in state
@odrotbohm I'm sure I've oversimplified a little bit how easy this would be (e.g. maybe a Spring modulith app will need to poll the database at interval to see if there are any newly published events that it can start processing...), but hopefully not by much. Does it sound feasible? |
Thanks for your feedback, that's appreciated! I would like this ticket to focus on the actual redesign of the lifecycle going forward, but here a few closing remarks on the overall situation. This ticket exists because we are aware of the limitations of the current approach. That said, that approach wasn't chosen accidentally. It's been the simplest thing we could get away from to deliver on the core of the task: providing a safety net for errors happening in transactional event listeners. We were accepting a few cumbersome areas for developers, for example, having to take care of distributed locking for multi-instance deployments, for multiple reasons: For one, we wanted to get feedback from real-world applications that allows us to make more informed decisions about a revision than we could have made in 1.0. That revision is what we're discussing here. For the other, none of the challenges you describe are unique to Spring Modulith (“potentially catastrophic consequences” 😉). Multi-instance deployments create those, even for a standard Spring application that uses I think the fundamental EPR approach has proven useful and from the production deployments I have seen I'd say that it's a valuable solution with obvious areas for improvement. That's what we're working on. While the technical challenges are understood and some of them will be addressed with this ticket, there's a theme to be identified from some of your remarks. First, which problem the event publication registry is trying to solve is clearly described in the reference documentation: a safety net for transactional event listeners, essentially – but not only – to allow application modules to be integrated in an eventually consistent way. If we started to augment the reference documentation with all things that the EPR is not, we'd never run out of work. The chapter of the docs clearly talks about the context first, describes how the EPR helps and is implemented, and only then goes on to talk about the interaction with external systems. There's no work distribution mentioned or going on, no messaging involved until that last part, and I feel that the confusion rather stems from thinking about a message-broker based system already when all we talk about is Spring application events. I can see that folks regularly exposed to messaging systems might arrive at the docs with certain assumptions, but again, it's reference documentation, and we clearly differentiate between events and messages. If someone thinks “Kafka” every time I write “event” repeating that we do not mean “Kafka” is not going to improve the documentation, but worsen it. Now on to your technical feedback:
It's not as easy, unfortunately. Depending on the transaction isolation level, different instances might still read the same data and try to update the same rows. In the relational world, we can get around this problem with a
I am sincerely considering deprecating and eventually removing that flag, as it both creates all the wrong expectations and is not really used in practice. Real-world applications would have to augment it with some kind of scheduled resubmission logic anyway, as you would rather not have to restart an instance just to trigger that. Once you have the scheduling in place, the flag becomes useless.
Again, I wish it was that simple. A large number of events to be resubmitted might still be an issue to the resubmission attempt. We're currently looking into techniques such as pagination, to allow multiple instances to resubmit publications. But at the same time, we'd like to allow developers to restrict the resubmission to only one instance if needed, for example in cases in which resubmitting events in strict order is necessary. |
because maybe the same can be achieved with an atomic update for claiming an event just before re-sending to the event listener: // claiming the event
int updated = jdbcTemplate.update(
"UPDATE outbox SET status = 'IN_PROGRESS' WHERE id = ? AND status = 'FAILED'",
id
);
if (updated == 1) { // Successfully claimed, otherwise it was claimed by other instance
// re-send to listener
} and in mongodb you could achieve the same effect with a This may avoid distributed locks for re-sending failed events while keeping a simple implementation.. |
Republishing events during startup remains challenging in multi-instance environments, as it’s unclear whether a particular instance is the initial one launching or simply an additional node scaling up. Although the approach mentioned above (using an atomic update to claim an event) can prevent resending duplicate failed events, if the application crashes and leaves events in an "IN_PROGRESS" state, we’d want to mark those as "FAILED" only when the first instance starts. To achieve this, we might need something like a heartbeat table or a similar mechanism... |
@odrotbohm @ivangsa As ever, thanks for taking the time to seriously consider my feedback - I appreciate your both taking the time and giving the benefit of your expertise. I still think the current document needs expanding with more information about how events work and the limitations and potential pitfalls. Without that, I think the reference documentation is simply incomplete, and I think the other conversations I pointed to demonstrate that - when you have several people feeding back that they read the documentation carefully and did not understand something very important, I think that's worth taking seriously. Naturally, I'm just one voice in the community, but if there were a vote, I'd say that adding the missing pieces to the current documentation is more important than building new features. And it only took a couple of paragraphs in my original comment... My question/proposal about allowing modulith events to be distributed across instances was a response to @odrotbohm's invitation to hear feedback from the community about the potential exciting benefits of the new lifecycle states. I appreciate you need to support multiple data storage types, which complicates the issue with transactions, and I appreciate that spreading workload across instances isn't the current target of these changes. I just thought it was worth exploring and that the improved lifecycle management seemed to point naturally towards enabling workload to be shared across multiple instances. And if it could be done without too much extra work, there could be great benefits. Thanks again, both and keep up the great work :-) |
hi @ghferrari by distributing events across instances, do you mean: that event A that has N listeners, some of those listeners are spread across different instances? I didn't thought about this.. I was talking more about resubmiting failed events. When scheduling the resubmition in multiple instances there is a chance that multiple instances process the same event for resumbition, resulting in duplicates.. Currently, the 'only' solution is to use distributed scheduling locks (e.g., ShedLock) because the event publication registry just contains However, with the new Event Lifecycle, if there’s a and this approach would work consistently for any SQL database or MongoDB. Even if it doesn’t become part of the Spring Modulith library, you could still implement it. There would be no need for distributed locks. but, spread listeners across multiple instances is a totally different beast... |
I’d like to +1 this. When I initially learned about Spring Modulith and went through the documentation I thought that everything would Just Work™️ in a distributed setup with multiple application instances sharing a database. How to use these events in a multi-instance setup is still not clear to me, but even if there is no (simple) answer to that a warning that that isn’t supported out of the box would have avoided some misunderstandings for me and my colleagues. |
@ivangsa As a quick reply to your last comment, I don't see any important difference between publishing events and republishing them. Suppose you have 10 app instances running with identical configuration and without using anything like schedlock. If there's a reliable mechanism for allowing failed events to be resubmitted/re-adopted by just one of these instances, then you can use the same mechanism for just-published events and allowing them to be spread across app instances. The basic idea would be that the instance that initially publishes an event stores it in the database with status If we want to discuss this further, I suggest moving to a different discussion thread unless @odrotbohm specifically wants to keep the discussion here. It's not unrelated to the lifecycle proposal, but separate enough to move to another thread if anyone wants to discuss it further. |
The persistent structure of an event publication currently effectively represents two states. Their default state captures the fact that a transactional event listener will have to be invoked eventually. The publication also stays in that state while the listener processes the event. Once the listener succeeds, the event publication is marked as completed. If the listener fails, the event publication stays in its original state.
This basic lifecycle is easy to work with, but has a couple of downsides. First and foremost, we cannot differentiate between publications that are about to be processed, ones that are processed and ones that have failed. Especially the latter is problematic, as a primary use case supported by the registry is to be able to recover from erroneous situations by resubmitting failed event publications. Developers usually resort to rather fuzzy approaches like considering events that have not been completed in a given time frame to be incomplete.
To improve on this, we’d like to move to a more sophisticated event publication lifecycle that allows to detect failed ones easier. One possible way to achieve this would be to introduce a dedicated status field, or — consistent with the current approach of setting a completion date — a failed date field which would need to be set in case an event listener fails. That step, however, might fail as well, as the erroneous situation that leads to the event listener failing in the first place. That’s why it might make sense to introduce a duration configuration property, after which incomplete event publications might be considered incomplete as well.
The feature bears a bit of risk, as we will have to think about the upgrade process of Spring Modulith applications. Existing apps might still contain entries in the database of incomplete event publications.
Ideas / Action Items
failedDate
column.publishedDate before now() minus duration
.CompletionRegisteringMethodInterceptor
would need to issue the marking as failed on exception.IncompleteEventPublications
would have to get agetFailedPublications()
.Related tickets
The text was updated successfully, but these errors were encountered: