Listener registration failing with RunnerScaleSetNotFoundException #3935

tomhaynes · 2025-02-19T12:31:37Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3,0.10.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This has been happening this morning across our estate. Runnersets that had been happily running and registered are suddenly failing with this.

Describe the bug

Having previously been healthy, our listeners are failing to register with the Github API, throwing the following error:

2025/02/19 07:05:16 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "c16014ee-8847-41ae-87e9-0feb3061b89a": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.

This causes them to repeatedly retry until we have exhausted our API limits, causing all runners to cease to work.

Describe the expected behavior

Successful registration

Additional Context

Controller Logs

2025-02-19 12:29:28.363	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Created listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener"}
2025-02-19 12:29:28.350	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Creating listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener"}
2025-02-19 12:29:28.347	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Creating a listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}}
2025-02-19 12:29:27.545	
2025-02-19T12:29:27Z	INFO	AutoscalingListener	Listener pod is terminated	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener", "reason": "Error", "message": ""}

Runner Pod Logs

025-02-19 12:23:42.746	
2025/02/19 12:23:42 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "7efdb2aa-1b5d-471a-8557-855bd8eeea18": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:30.867	
2025/02/19 12:23:30 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "c0c41d56-8847-41ae-87e9-0feb3061b89a": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:26.668	
2025/02/19 12:23:26 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "fd0647bd-4c98-4217-902b-0b7ca3343818": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:22.694	
2025/02/19 12:23:22 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "fd064c6c-4c98-4217-902b-0b7ca3343818": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:10.578	
2025/02/19 12:23:10 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "9802e18f-984c-4049-ad4b-b3d837c00514": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:06.608	
2025/02/19 12:23:06 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "7efd86a8-1b5d-471a-8557-855bd8eeea18": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.

The text was updated successfully, but these errors were encountered:

github-actions · 2025-02-19T12:32:12Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

tomhaynes · 2025-02-19T12:33:15Z

@nikola-jokic following from the problems we saw a few weeks ago, could this be an issue on the Github API side?

nikola-jokic · 2025-02-19T13:32:21Z

Hey @tomhaynes,

It seems like the scale set has been removed at 2025-02-19T00:37:00Z. This might be an issue on ARC side. Can you please describe what was happening on the cluster during that time? The controller log and the listener log should help us understand what was going on around that time.

tomhaynes · 2025-02-19T13:44:37Z

Hi thanks for the response @nikola-jokic, I'm perhaps being stupid but where do you see that timestamp?

This is a development cluster, and its shutdown at midnight 2025-02-19T00:00:00Z. I did wonder if perhaps its a race condition between the controller shutting down, and it not correctly cleaning up the various CRDs that it controls?

We are also seeing this error now:

ERROR	Reconciler error	{"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"xxx","namespace":"xxx"}, "namespace": "xxx", "name": "xxx", "reconcileID": "89d4d0d7-130c-4c6d-8a02-06d27d1cf277", "error": "failed to get actions service admin connection on refresh: github api error: StatusCode 422, RequestID \"87F4:155BD3:62627C:7A7441:67B5D5B6\": {\"message\":\"Validation Failed\",\"documentation_url\":\"https://docs.github.com/rest\",\"status\":\"422\"}"}

Uninstalling a specific gha-runner-scale-set chart, cleaning up all associated CRDs and reinstalling does appear to resolve the problem for that runner.

Are there any logs to look out for in the controller that might indicate a non-graceful shutdown?

nikola-jokic · 2025-02-19T14:00:05Z

We looked into traces on the back-end side to understand what is going on. It is likely the race condition. If the controller shuts down without having enough time to clean up the environment, it can cause issues like this.
Having said that, we should invest more effort into making this kind of issue recoverable. Perhaps, we should try to re-install resources when we notice errors like the one you reported.

As for the log, this is also tricky. Basically, you would have to inspect the log and see that some steps that should be taken are missing. Having said that, it would be a good idea to log as soon as the shutdown is received, so you can spot these issues by checking logs below the termination mark. This solution cannot be perfect, especially when the controller is stopped without any graceful termination period, but it would help to diagnose issues with the cleanup process.

tomhaynes · 2025-02-19T14:23:45Z

We've got slightly closer possibly with one of the errors. A runnerset is throwing this error:

2025-02-19 14:11:43.776 | 2025/02/19 14:11:43 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "ea91597a-3ca5-498e-a6e6-82ea9d2779b1": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 1.

And I can see that the "Runner scale set" is missing when I look at the repository in Github UI. What would have removed that runner set on the github side? Could we raise a feature request to have the autoscalingrunnerset recreate it when this happens?

tomhaynes · 2025-02-19T15:32:06Z

could it be related to this? actions/runner#756

tomhaynes · 2025-02-19T16:28:25Z

we've worked out a semi unpleasant way to force re-registration:

# To re-install the runners in a cluster:
export namespace=$namespace

# Remove the annotations from the autoscalingrunner
kubectl -n $namespace get autoscalingrunnerset --no-headers | awk '{print $1}' | xargs kubectl -n $namespace patch autoscalingrunnersets --type=merge -p='{"metadata": {"annotations": {"actions.github.com/values-hash": null,"runner-scale-set-id": null}}}'

# Remove the ephemeralrunnersets
kubectl -n $namespace get ephemeralrunnerset --no-headers | awk '{print $1}' | xargs  kubectl -n $namespace delete ephemeralrunnerset

# Remove the annotation from the autoscalinglistener
kubectl -n $namespace get autoscalinglistener --no-headers | awk '{print $1}' | xargs kubectl -n $namespace patch autoscalinglistener --type=merge -p='{"metadata": {"annotations": {"actions.github.com/runner-spec-hash": null}}}'

.. which at least avoids the finalizer hell of helm uninstalls. It'd be great to understand what causes the runner sets to disappear on the Github repo side?

Also is there any way to request an API to list the runner sets on the repo? I saw it was raised here #2990 and the raiser was directed to the community page.. I tried and failed to see whether it has been requested there...

nikola-jokic · 2025-02-20T11:10:03Z

So the deletion probably occured inside the autoscaling runner set controller. The shell script you just wrote forces the controller to think this is a new installation and re-creates resources properly, removing old resources and starting from scratch.
If you have it, can you please provide the controller log before and after it was terminated. I would love to inspect what was going on and fix this so you don't have to use hacks to recover.

As for the API documentation, we did talk about documenting scale sets APIs, but not just yet. There are some improvements we want to do, and some of them would be considered as breaking changes.

dagi3d · 2025-02-21T21:59:43Z

solution proposed by @tomhaynes also worked for me as I was facing the exact same issue, thanks a lot

tomhaynes · 2025-02-24T12:11:49Z

Hi @nikola-jokic we've hit this issue quite a lot this morning. I've attached logs from an example listener, and the controller from shutdown time on Friday evening, and startup this morning. I can't see anything particularly relevant in them, mind. The controller does not seem to log anything at all at the point that it is shutdown. Could you please answer the following:

Can you point to where in the controller code it does this replica set deletion? Is there a higher log level that we could set to see where this is happening? We could also fork the code to add extra logging if that would help...
As I linked above there has been significant discussion around the replica sets being cleaned up on the Github side if they are idle for a time. We have not changed our setup at all, and it had previously been quite stable for a number of months.. which suggests something may have changed on the Github side. Also the fact that the issue mainly happens after a weekend suggests it is down to some kind of expiry. Is this also a possible explanation?
We've previously mentioned this possibly being a race condition between listener & controller deletion - am I right in thinking that we should try to keep the controller up for longer than the listeners? If that is the case we can look to add a prestop hook to the controller pod, to pause the TERM signal.

gha-arc-logs.txt

nikola-jokic · 2025-02-24T13:59:14Z

Hey @tomhaynes,

Just to confirm I understood correctly, you removed the installation on Friday along with the controller, and you re-installed it this morning?

To answer your questions, here is the area that is responsible for the cleanup. Basically, we uninstall the listener. Then we uninstall the ephemeral runner set which can take some time depending on the number of runners that are online. Once we are finished, we delete the scale set. At that point, the cleanup is successful.

To answer the second question, I would first need to understand how the environment is shut down. I think probably there is the root of the issue, so please help me understand the shutdown steps.

You are right, the lifetime of the listener must be shorter than the lifetime of the controller. Please help me understand the shutdown, so I can try to reproduce it properly.

tomhaynes · 2025-02-24T15:46:19Z

Hi @nikola-jokic thanks for the quick reply - we don't remove the installation / any CRDs, we just gracefully terminate the Kubernetes worker nodes i.e. shutdown the pods via SIGTERM. The entire cluster is shutdown at the same time, so there are no guarantees around which GHA pods shutdown first.

The next day / week the environment powers back up with the CRDs unchanged.

We've not changed this design for a long time - we used to use the summerwinds controllers but migrated over to arc in ~March last year.

tomhaynes · 2025-02-24T17:18:06Z

hmm the controller's helm chart doesn't support lifecycle hooks, I could add this in but if this was the issue surely others would be seeing the same..

tomhaynes · 2025-02-25T09:35:08Z

@nikola-jokic things have started fairly healthy this morning - given these issues seem to largely happen after the weekend, when our dev environments have been down for a few days, it seems highly probable that it is not some kind of shutdown race condition, but rather some kind of expiry process happening in the Github backend .. if I gave you a specific runner set for a specific repo do you have an api audit log that would show what removed it?

nikola-jokic · 2025-02-25T09:39:28Z

Hey @tomhaynes, yes please! It might be something related to session expiration, so please send the scale sets that are misbehaving, so we can see what wen't wrong there. Thank you for providing everything I'm asking for in a timely manner!

tomhaynes · 2025-02-25T09:47:57Z

great thanks @nikola-jokic, so over the weekend on this repo the following runner scale set disappeared (redacted after analysis done):

repo: https://github.com/REDACTED/REDACTED
runner scale set name: REDACTED
runner scale set id: 44

First incidence of the log error:

2025/02/24 07:22:18 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "eac4d5e6-16ab-4ff6-a77c-a2b4b2537055": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 44.

nikola-jokic · 2025-02-25T11:46:31Z

Hey @tomhaynes,

We found the root cause of this issue! We disabled the cleanup for your scale sets until we permanently fix it, so you shouldn't encounter similar problem anymore. I will keep you all posted about the progress, and if anyone else experiences a similar issue before the fix is done, please let me know!

And thank you again for providing all the necessary information we needed to find the root cause!

tomhaynes · 2025-02-25T13:49:32Z

Brilliant thanks @nikola-jokic! thank you for all your help

tomhaynes · 2025-02-25T13:57:06Z

We disabled to cleanup for your scale sets
@nikola-jokic can I just check the scope of this - it's for the entire ITV org? Are there any risks to disabling this?

nikola-jokic · 2025-02-25T15:09:13Z

There should be no risks, we just disabled the automatic cleanup for now, until we completely fix it. That just means that if you stop the cluster and the scale set is not talking to our API for more than 7 days, they won't be cleaned up automatically.

tomhaynes · 2025-02-26T10:22:37Z

Morning @nikola-jokic as I said above could you confirm this has been done for our entire org? We've seen more disappearing scale sets this morning 😢

nikola-jokic · 2025-02-26T11:43:13Z

Hey, our support reached out to your company to ask for specifics of your setup. This feature is scale set based, so if you have some scale sets registered on the repo level, and others registered on the org level, we need to enable this feature for org, and for each repository. Please let them know so we can properly enable it for all scale sets you have running. Sorry for the inconvenience, I thought it will be sorted out through that channel of communication.

cmiller01 · 2025-02-26T20:24:55Z

Chiming in that we're experiencing the same issues, where listener wasn't able to come up, and effectively reregistering scale set with: #3935 (comment) works

tomhaynes · 2025-02-27T09:30:30Z

@cmiller01 @dagi3d We've just had a reply from a separate support thread, so hopefully we should stop seeing these issues 🙏 :

The engineering team has found a different way to mitigate the issue and this is now rolled out globally.
 
Scale sets created more than a year ago will still get removed if they are offline for a day.
Scale sets created recently will not face this issue.

tomhaynes added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 19, 2025

marktuckcp mentioned this issue Feb 20, 2025

Scaleset controllers stuck every night with a "RunnerScaleSetSessionConflictException: there is already an active session" #3922

Closed

4 tasks

nikola-jokic removed the needs triage Requires review from the maintainers label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listener registration failing with RunnerScaleSetNotFoundException #3935

Listener registration failing with RunnerScaleSetNotFoundException #3935

tomhaynes commented Feb 19, 2025

github-actions bot commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 20, 2025

dagi3d commented Feb 21, 2025

tomhaynes commented Feb 24, 2025 •

edited

Loading

nikola-jokic commented Feb 24, 2025

tomhaynes commented Feb 24, 2025 •

edited

Loading

tomhaynes commented Feb 24, 2025

tomhaynes commented Feb 25, 2025

nikola-jokic commented Feb 25, 2025

tomhaynes commented Feb 25, 2025 •

edited

Loading

nikola-jokic commented Feb 25, 2025 •

edited

Loading

tomhaynes commented Feb 25, 2025

tomhaynes commented Feb 25, 2025

nikola-jokic commented Feb 25, 2025

tomhaynes commented Feb 26, 2025

nikola-jokic commented Feb 26, 2025

cmiller01 commented Feb 26, 2025

tomhaynes commented Feb 27, 2025

Listener registration failing with RunnerScaleSetNotFoundException #3935

Listener registration failing with RunnerScaleSetNotFoundException #3935

Comments

tomhaynes commented Feb 19, 2025

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

tomhaynes commented Feb 19, 2025

nikola-jokic commented Feb 20, 2025

dagi3d commented Feb 21, 2025

tomhaynes commented Feb 24, 2025 • edited Loading

nikola-jokic commented Feb 24, 2025

tomhaynes commented Feb 24, 2025 • edited Loading

tomhaynes commented Feb 24, 2025

tomhaynes commented Feb 25, 2025

nikola-jokic commented Feb 25, 2025

tomhaynes commented Feb 25, 2025 • edited Loading

nikola-jokic commented Feb 25, 2025 • edited Loading

tomhaynes commented Feb 25, 2025

tomhaynes commented Feb 25, 2025

nikola-jokic commented Feb 25, 2025

tomhaynes commented Feb 26, 2025

nikola-jokic commented Feb 26, 2025

cmiller01 commented Feb 26, 2025

tomhaynes commented Feb 27, 2025

tomhaynes commented Feb 24, 2025 •

edited

Loading

tomhaynes commented Feb 24, 2025 •

edited

Loading

tomhaynes commented Feb 25, 2025 •

edited

Loading

nikola-jokic commented Feb 25, 2025 •

edited

Loading