Skip to content

Overhaul event publication lifecycle #796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
odrotbohm opened this issue Sep 3, 2024 · 15 comments
Open

Overhaul event publication lifecycle #796

odrotbohm opened this issue Sep 3, 2024 · 15 comments
Labels
in: event publication registry Event publication registry type: enhancement Major enhanvements, new features

Comments

@odrotbohm
Copy link
Member

odrotbohm commented Sep 3, 2024

The persistent structure of an event publication currently effectively represents two states. Their default state captures the fact that a transactional event listener will have to be invoked eventually. The publication also stays in that state while the listener processes the event. Once the listener succeeds, the event publication is marked as completed. If the listener fails, the event publication stays in its original state.

This basic lifecycle is easy to work with, but has a couple of downsides. First and foremost, we cannot differentiate between publications that are about to be processed, ones that are processed and ones that have failed. Especially the latter is problematic, as a primary use case supported by the registry is to be able to recover from erroneous situations by resubmitting failed event publications. Developers usually resort to rather fuzzy approaches like considering events that have not been completed in a given time frame to be incomplete.

To improve on this, we’d like to move to a more sophisticated event publication lifecycle that allows to detect failed ones easier. One possible way to achieve this would be to introduce a dedicated status field, or — consistent with the current approach of setting a completion date — a failed date field which would need to be set in case an event listener fails. That step, however, might fail as well, as the erroneous situation that leads to the event listener failing in the first place. That’s why it might make sense to introduce a duration configuration property, after which incomplete event publications might be considered incomplete as well.
The feature bears a bit of risk, as we will have to think about the upgrade process of Spring Modulith applications. Existing apps might still contain entries in the database of incomplete event publications.

Ideas / Action Items

  • Expand database table with a failedDate column.
  • Introduce a configuration property, after which duration to consider a publication failed.
  • Queries would need to be augmented with publishedDate before now() minus duration.
  • CompletionRegisteringMethodInterceptor would need to issue the marking as failed on exception.
  • IncompleteEventPublications would have to get a getFailedPublications().

Related tickets

@breun
Copy link

breun commented Sep 13, 2024

Does not being able to tell that an event is being processed also mean that currently multi-instance apps are not an option?

I’m not a database expert, but I believe at least PostgreSQL supports row-level locking, which would allow concurrent processing of events by multiple instances, unlike some leader election method.

@annnoo
Copy link

annnoo commented Sep 18, 2024

@breun
We are currently using spring-modulith and the event-publication in one of our projects. Multi-Instance apps are possible, but the misconception we had is that the whole event publication log is not a "Message Queue" - it is currently a publication log. The whole table just keeps track what event's have been sent and allows you to retry them on startup - and he it can't distinguish between currently being processed events and "stuck" events.

The whole processing (sending event -> handler -> mark as finished) is not done by "submitting" the event to the table and a "worker" picks it up. The processing happens always on the instance the Event has been sent in the first place.

The publication_log is just there to keep track of what events have been processed. And the only information you currently have is if there is a completion_date on the event and when it got published.

We've built our own retry mechanism around the log, in which we are retrying events that are at least n-minutes old, but because it is an "publication_log" we have the issue that sometimes events get processed multiple times, when an event takes a long time to be processed (either because one of the steps take a long time or when a lot of events get sent and they can't get processed because all threads in our thread pool are already busy).

And that's where this misconception comes into place. If you work around the fact that the table is not used for processing at all than you may get into these issues if you build your retry mechanism.

My thoughts around this topic

We would really wish for a way to distinguish between events that are currently being processed and events that have failed, but in all implementations there are edge cases which you may or may not support in spring-modulith.

If you have a dedicated status field (e.g. SUBMITTED, PROCESSING, FINSIHED, FAILED) you can easily find out which events to retry, based on the failed field and can skip the PROCESSING ones - unless you have events that are struck, because the instance went down when processing them. In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active.

If you handle it via a failedDate column you have to identify the currently being processed ones via an offset (as described in the issue description) - but here you have to be careful with longer running tasks, because as i mentioned, it can happen that it takes a few minutes until an event is picked up (because of all threads are being utilized)

In that case it could make sense to also have an column for when the Event got picked up and the handler is being triggered....

Conclusion

Thinking more about it, a big problem with the event_publication table is the misconception I mentioned. For example I expected that the event_publication could be seen as a "light" version of an event-externalization, but it does definetely does not work that way (and probably shouldn't be used in that way). From my gut feeling (and talking with colleagues about that) it feels like that I am not the only one that stepped into that.

Maybe I am jumping a little bit to a different topic, where this issue isn't about, but I think it should be clearer from the docs that the current event_publication mechanism should not be seen as as an externalized event processing mechanism and should show up the limitations of that.

And regarding what the users are expecting (and what @breun even mentioned)
Could it make sense to have "some kind" of event-externalization for postgres or other databases? Or at least some functionality for putting the message processing more to the database? I know this is a lot of work (and definetely not part of the issue - I just wanted to mention it here) , but I have a feeling that that's what developers want and "see" in the event_publication, what it is not.

Edit:
Removed

(unless you use the event externalization, meaning that events are sent and handled via Kafka, SQS, SNS... etc. - haven't touched this to be honest)

I just took a look into the docs and Event-Externalization means that you just publish events to other systems so that other applications can get them - not that you consume them from these.

@aahlenst
Copy link
Contributor

aahlenst commented Nov 7, 2024

It is unclear to me what Modulith's responsibility should be: ensuring the delivery of events only or also dealing with problems.

To ensure the delivery of events, failedDate seems questionable. Either the event has been processed or not. Handling failures is the responsibility of the application code. The only failures the application cannot handle are failures caused by the event delivery mechanism (event can no longer be deserialized, …). But I have a hard time imagining a scenario where I might find failedDate alone useful. Either I need more diagnostics (see next paragraph), or I have to choose between dropping the event or retrying forever.

For handling problems, failedDate alone is inadequate. For effectively dealing with failed deliveries, I would have to distinguish whether the failures might be transient (think OptimisticLockingFailureException, some network problem) or permanent (object concerning the event has been deleted, …). Furthermore, I would want to keep track of the number of failures and stop retrying after some attempts. This means we're quickly getting into the territory of Spring Retry and friends.

Therefore, I think Modulith should focus on the delivery side of things, for example, by better tracking the status (event queued, event processing, …) and providing better docs, and perhaps some callbacks, on how to deal with event delivery problems, like event classes that have been removed or event listeners that no longer exist.

@ooraini
Copy link

ooraini commented Dec 1, 2024

Event publication in modulith is simple and I think(my opinion) that expanding it will get you into a scope creep and you will not know when to stop.

Thinking about the problem differently, instead of an event handler that is recorded(and committed) as part of a transaction, what about a job getting scheduled. Now you are solving a different problem, running background jobs, which we have many solutions for, and the typically solve:

  • Locking (this issue)
  • Failure handing and retries
  • ...

I'm experimenting with db-scheduler and it does exactly that.

@breun
Copy link

breun commented Dec 1, 2024

In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active.

Does this mean you assume that only one instance in a multi-instance is actively doing work?

I’m still trying to wrap my head around this concept and its implications. I have the feeling I don’t grasp it completely yet, but my main feeling is now that it feels ‘dangerous’ to use this for high-traffic multi-instance services, which is a typical use case for me.

I’ve seen most Spring Modulith talks promoting using events instead of direct method calls across module boundaries, and I see the benefits of decoupling via events, but I feel that this event publication mechanism could become a performance/availability risk when it becomes a dependency for every call to a high-traffic module method that publishes an event for other module’s to listen for. In that sense a direct method feels ‘safer’ than using events: no database that could have a full table, or otherwise impact performance or availability.

Is this an irrational fear I have of the Spring Modulith event publication mechanism? Should I just configure Modulith to auto-remove completed events in a high-traffic scenario to avoid the event log from growing indefinitely and not worry about it too much otherwise?

@ivangsa
Copy link

ivangsa commented Mar 24, 2025

Including a failed_attempts column would help implement a max retries logic as well as "Some other node is working on it" logic when failed_attempts == 0

@odrotbohm
Copy link
Member Author

I've played with a couple of ideas that created new questions (who would've thought!? :D).

Event Publication States

Fundamentally, I'd like to introduce new states for event publication. A basic new statemachine would look like this:

The core goal here would be to introduce the new processing state, so that we can clearly differentiate between publications currently worked on and ones that have failed. Resubmissions would flip the state from failed to published. The transition from published to processing would be triggered by the interceptor decorating the listeners.

An alternative state design could introduce a dedicated resubmitted state as follows:

This would have the advantage of being able to differentiate between "fresh" submissions and resubmitted ones based on the state alone without having to refer to further publication details (more on that below). I don't see the additional state having an impact on the complexity of the necessary database interactions. So I think it might be worth introducing it as making concepts explicit is a valuable thing in general. I'd love to hear your opinions on that.

Event Publication Details

As many of you have expressed in the discussion so far, it might be useful to capture additional information in the publications.

  • status – obviously needed
  • lastFailureDate, lastResubmissionDate – I think it would be helpful to identify publications that have failed a long time ago and not been resubmitted. Or even more precise: if the time span between the two dates grows beyond a certain threshold.
  • numberOfCompletionAttemps / numberOfFailures – I am leaning towards going with the former as it would allow us to increment that number during the state transition from published / resubmitted to processing, i.e. before the event listener logic is triggered. That logic might be hazardous to the system and the system crashing entirely would still leave us with the resubmission attempt recorded. An update of numberOfFailures would have to be the response to a listener or system that has already seen some kind of failure and is less likey to succeed.

These additional properties would allow us to offer the following functionality out of the box:

  • explicitly select only failed publications and ones in progress
  • safely automatically resubmit publications in multi-instance scenarios (select … for update)
  • automatically resubmit publications stuck in published / resubmitted in case the system crashed in between the commit of the original business transaction and the actual execution of the listener
  • automatic resubmission upon a configurable threshold (“don't retry more than 5 times”)
  • guard resubmission by the age of a publication (“don't retry if the publication has failed more than 5 hours ago”)

Call to action

I'd love to hear if you think anything obvious is missing or — alternatively — if you think that such an arrangement would allow you to achieve the things you want or currently do (albeit in a more complex way). For the new functionality to be added, I see that we'll have to discuss which of these should be exposed as simple configuration options and which need callback APIs being introduced, but I'd like to postpone that discussion for now.

@ghferrari
Copy link

@odrotbohm I think the general proposal to store information about the event lifecycle is a very good one. In particular, I think it brings a brilliant opportunity to solve some significant problems with the current version of Spring Modulith.

Based on your conversation with @breun above and other discussions (#402, #727), I've understood that:

  • Spring Modulith events are (at least initially) intended to be published and processed within the same JVM instance.
  • Spring Modulith serializes its events and stores them in your database.
  • It is possible and desirable to configure a Spring application to republish incomplete events on startup. For example, if your Spring app gets restarted before an event can be completed, it can be configured to "republish" these events, which I take to mean those events will then get processed within that new Spring instance.

This situation causes several problems:

  • It is more difficult to use Spring Modulith in situations where more than one instance of your Spring app is running (i.e. when your app is horizontally scaled). As mentioned in the other discussions, using Spring Modulith in a horizontally scaled context requires somehow arranging that only one instance of Spring is configured to republish events on startup. Otherwise, you could end up with multiple Spring instances processing the same event - with potentially catastrophic consequences.
  • Even if you (somehow) arrange for only one Spring instance to be republishing events, that means that the number of Spring app instances processing republished events will only ever be a maximum of 1. That is a clear bottleneck in a horizontally scaled application.
  • There's also a related bottleneck with events when they are published the first time. Suppose you have 5 instances of your Spring app running, and suppose one of your instances publishes 1000 events. There is currently no mechanism to share that workload over the five instances. Instead, these events will (in the first instance) only be attempted by the instance that published them.

On a side note, I should point out I found none of the above information in the current Spring Modulith documentation, and the discussions I mentioned above seem to show that others also did not realise that Spring Modulith worked like this. So, my first suggestion is that the current Modulith documentation really needs revising to add these details. Perhaps (just a guess) this stuff is obvious to some people (I don't know, maybe that's always how Spring Application Events have worked...) but it wasn't obvious to me and it's not spelled out in the documentation. Given how common horizontal scaling is, I would say the documentation needs updating urgently because otherwise the effects might be catastrophic.

But, it seems to me that adding state information about the lifecycle events provides just the opportunity you need to solve ALL of the above problems - and to do be able to do it out of the box, without needing anything like schedlock.

From @odrotbohm 's draft state diagrams above, we see that events arrive into the database in state published and then move into state processing. Well, in that case, it seems like it's going to be ultra-simple to ensure that events are processed only by one instance of a Spring application - simply wrap the transition from published to processing in an appropriate database transaction and voila, only one app instance is going to be able to start processing an event. In a single stroke, this solves all the problems above:

  • You can use horizontally scaled Spring Modulith app out of the box without needing anything like Schedlock
  • You can, if you wish, make republish events on startup the default. Or at least, users can configure that across all instances of their app.
  • All the scaling issues are removed - events and republished events can be processed by any instance of your app.
  • Any changes to the documentation can be minimal because most of the problems above are now avoided rather than left as an exercise to the reader. And because of that, you would no longer need to use an external event broker like Kafka to get around the current limitations.

@odrotbohm I'm sure I've oversimplified a little bit how easy this would be (e.g. maybe a Spring modulith app will need to poll the database at interval to see if there are any newly published events that it can start processing...), but hopefully not by much. Does it sound feasible?

@odrotbohm odrotbohm pinned this issue Apr 6, 2025
@odrotbohm
Copy link
Member Author

Thanks for your feedback, that's appreciated! I would like this ticket to focus on the actual redesign of the lifecycle going forward, but here a few closing remarks on the overall situation.

This ticket exists because we are aware of the limitations of the current approach. That said, that approach wasn't chosen accidentally. It's been the simplest thing we could get away from to deliver on the core of the task: providing a safety net for errors happening in transactional event listeners. We were accepting a few cumbersome areas for developers, for example, having to take care of distributed locking for multi-instance deployments, for multiple reasons: For one, we wanted to get feedback from real-world applications that allows us to make more informed decisions about a revision than we could have made in 1.0. That revision is what we're discussing here. For the other, none of the challenges you describe are unique to Spring Modulith (“potentially catastrophic consequences” 😉). Multi-instance deployments create those, even for a standard Spring application that uses @Scheduled. Be reminded that we're dealing with reference documentation for an open-source project, not a book containing general architectural advice.

I think the fundamental EPR approach has proven useful and from the production deployments I have seen I'd say that it's a valuable solution with obvious areas for improvement. That's what we're working on.

While the technical challenges are understood and some of them will be addressed with this ticket, there's a theme to be identified from some of your remarks. First, which problem the event publication registry is trying to solve is clearly described in the reference documentation: a safety net for transactional event listeners, essentially – but not only – to allow application modules to be integrated in an eventually consistent way. If we started to augment the reference documentation with all things that the EPR is not, we'd never run out of work. The chapter of the docs clearly talks about the context first, describes how the EPR helps and is implemented, and only then goes on to talk about the interaction with external systems. There's no work distribution mentioned or going on, no messaging involved until that last part, and I feel that the confusion rather stems from thinking about a message-broker based system already when all we talk about is Spring application events. I can see that folks regularly exposed to messaging systems might arrive at the docs with certain assumptions, but again, it's reference documentation, and we clearly differentiate between events and messages. If someone thinks “Kafka” every time I write “event” repeating that we do not mean “Kafka” is not going to improve the documentation, but worsen it.

Now on to your technical feedback:

You can use horizontally scaled Spring Modulith app out of the box without needing anything like Schedlock

It's not as easy, unfortunately. Depending on the transaction isolation level, different instances might still read the same data and try to update the same rows. In the relational world, we can get around this problem with a SELECT … FOR UPDATE. Other supported stores might still need a distributed lock in place. There is ongoing design work internally of how to get such functionality and distribute it properly across the Spring projects' ecosystem.It's likely that we're going to see something emerge in the Framework 7.0 timeframe, which means for Spring Modulith 2.0 to help with that concern.

You can, if you wish, make republish events on startup the default. Or at least, users can configure that across all instances of their app.

I am sincerely considering deprecating and eventually removing that flag, as it both creates all the wrong expectations and is not really used in practice. Real-world applications would have to augment it with some kind of scheduled resubmission logic anyway, as you would rather not have to restart an instance just to trigger that. Once you have the scheduling in place, the flag becomes useless.

All the scaling issues are removed - events and republished events can be processed by any instance of your app.

Again, I wish it was that simple. A large number of events to be resubmitted might still be an issue to the resubmission attempt. We're currently looking into techniques such as pagination, to allow multiple instances to resubmit publications. But at the same time, we'd like to allow developers to restrict the resubmission to only one instance if needed, for example in cases in which resubmitting events in strict order is necessary.

@ivangsa
Copy link

ivangsa commented Apr 8, 2025

because SELECT … FOR UPDATE SKIP LOCKED is not supported in all relational db, nor in mongodb

maybe the same can be achieved with an atomic update for claiming an event just before re-sending to the event listener:

// claiming the event
int updated = jdbcTemplate.update(
    "UPDATE outbox  SET status = 'IN_PROGRESS' WHERE id = ? AND status = 'FAILED'",
    id
);
if (updated == 1) { // Successfully claimed, otherwise it was claimed by other instance
  // re-send to listener
}

and in mongodb you could achieve the same effect with a findOneAndUpdate

This may avoid distributed locks for re-sending failed events while keeping a simple implementation..

@ivangsa
Copy link

ivangsa commented Apr 8, 2025

Republishing events during startup remains challenging in multi-instance environments, as it’s unclear whether a particular instance is the initial one launching or simply an additional node scaling up.

Although the approach mentioned above (using an atomic update to claim an event) can prevent resending duplicate failed events, if the application crashes and leaves events in an "IN_PROGRESS" state, we’d want to mark those as "FAILED" only when the first instance starts.

To achieve this, we might need something like a heartbeat table or a similar mechanism...

@ghferrari
Copy link

@odrotbohm @ivangsa As ever, thanks for taking the time to seriously consider my feedback - I appreciate your both taking the time and giving the benefit of your expertise.

I still think the current document needs expanding with more information about how events work and the limitations and potential pitfalls. Without that, I think the reference documentation is simply incomplete, and I think the other conversations I pointed to demonstrate that - when you have several people feeding back that they read the documentation carefully and did not understand something very important, I think that's worth taking seriously. Naturally, I'm just one voice in the community, but if there were a vote, I'd say that adding the missing pieces to the current documentation is more important than building new features. And it only took a couple of paragraphs in my original comment...

My question/proposal about allowing modulith events to be distributed across instances was a response to @odrotbohm's invitation to hear feedback from the community about the potential exciting benefits of the new lifecycle states. I appreciate you need to support multiple data storage types, which complicates the issue with transactions, and I appreciate that spreading workload across instances isn't the current target of these changes. I just thought it was worth exploring and that the improved lifecycle management seemed to point naturally towards enabling workload to be shared across multiple instances. And if it could be done without too much extra work, there could be great benefits.

Thanks again, both and keep up the great work :-)

@ivangsa
Copy link

ivangsa commented Apr 9, 2025

hi @ghferrari by distributing events across instances, do you mean: that event A that has N listeners, some of those listeners are spread across different instances? I didn't thought about this..

I was talking more about resubmiting failed events. When scheduling the resubmition in multiple instances there is a chance that multiple instances process the same event for resumbition, resulting in duplicates..

Currently, the 'only' solution is to use distributed scheduling locks (e.g., ShedLock) because the event publication registry just contains completion_date == null as status marker, which is not enough to lock/claim failed events from different instances

However, with the new Event Lifecycle, if there’s a status column, then while looping through failed events (in DefaultEventPublicationRegistry.processIncompletePublications(...)), even if two instances are looping simultaneously, one instance could 'claim' an event and prevent other instances from processing it too

and this approach would work consistently for any SQL database or MongoDB. Even if it doesn’t become part of the Spring Modulith library, you could still implement it. There would be no need for distributed locks.

but, spread listeners across multiple instances is a totally different beast...

@breun
Copy link

breun commented Apr 9, 2025

I still think the current document needs expanding with more information about how events work and the limitations and potential pitfalls. Without that, I think the reference documentation is simply incomplete, and I think the other conversations I pointed to demonstrate that - when you have several people feeding back that they read the documentation carefully and did not understand something very important, I think that's worth taking seriously.

I’d like to +1 this. When I initially learned about Spring Modulith and went through the documentation I thought that everything would Just Work™️ in a distributed setup with multiple application instances sharing a database.

How to use these events in a multi-instance setup is still not clear to me, but even if there is no (simple) answer to that a warning that that isn’t supported out of the box would have avoided some misunderstandings for me and my colleagues.

@ghferrari
Copy link

@ivangsa As a quick reply to your last comment, I don't see any important difference between publishing events and republishing them. Suppose you have 10 app instances running with identical configuration and without using anything like schedlock. If there's a reliable mechanism for allowing failed events to be resubmitted/re-adopted by just one of these instances, then you can use the same mechanism for just-published events and allowing them to be spread across app instances. The basic idea would be that the instance that initially publishes an event stores it in the database with status published, but that instance won't necessarily respond to that event. Instead, all instances will poll the database for newly published events and whichever one adopts a published event first gets to run it.

If we want to discuss this further, I suggest moving to a different discussion thread unless @odrotbohm specifically wants to keep the discussion here. It's not unrelated to the lifecycle proposal, but separate enough to move to another thread if anyone wants to discuss it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: event publication registry Event publication registry type: enhancement Major enhanvements, new features
Projects
None yet
Development

No branches or pull requests

7 participants