Skip to content

fix: argoexec kill command in Windows to use osspecific.Kill #14352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

criscola
Copy link

@criscola criscola commented Apr 4, 2025

Fixes #14297 (see discussion also).

Motivation

When a Windows container fails e.g. due to a crash it goes into Error state. The wait container will trigger argoexec kill command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), since os.Kill was used. Upon replacing the latter with osspecific.Kill the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 so osspecific.Kill had to be adapted to also find the PID of the argoexec process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up.

Modifications

Modifies argoexec kill command to use osspecific.Kill and implement the relative Windows code for it.

Verification

Triggered a bluescreen by killing svchost.exe in the container, then seen argoexec not fail anymore when trying to kill the erroring Pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Argoexec kill command does not work in case of Windows container failures
1 participant