fix: argoexec kill command in Windows to use osspecific.Kill #14352
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #14297 (see discussion also).
Motivation
When a Windows container fails e.g. due to a crash it goes into Error state. The wait container will trigger argoexec kill command to signal to argo workflow controller that the step should be retried by creating another Pod. In the Windows container case, the error failed with code 64 (error in argoexec), since os.Kill was used. Upon replacing the latter with osspecific.Kill the issue was still not solved because as opposed to Linux, Windows containers don't have PID 1 so osspecific.Kill had to be adapted to also find the PID of the argoexec process. In our tests the fix works as expected, now Windows containers in Error state don't get stuck anymore, Argo will create another Pod and when the workflow completes the erroring Pod will be cleaned up.
Modifications
Modifies argoexec kill command to use
osspecific.Kill
and implement the relative Windows code for it.Verification
Triggered a bluescreen by killing
svchost.exe
in the container, then seen argoexec not fail anymore when trying to kill the erroring Pod.