Open
Description
For a while, we have been receiving sporadic reports about Keycloak not working properly, both via Alertmanager and various other communication channels.
Investigation today revealed that this is likely related to the vault-agent
sidecar container that runs in every Keycloak pod. This container regularly crashes with the following error:
2025-03-24T19:23:58.026Z [ERROR] agent: runtime error encountered:
error=
| template server: vault.write(internal-tls/issue/internal-tls -> fb6ab102): vault.write(internal-tls/issue/internal-tls -> fb6ab102): Error making API request.
|
| URL: PUT http://vault.vault.svc:8200/v1/internal-tls/issue/internal-tls
| Code: 400. Errors:
|
| * cannot satisfy request, as TTL would result in notAfter of 2025-07-22T19:23:58.023842036Z that is beyond the expiration of the CA certificate at 2025-06-26T23:39:49Z
exitCode=1
Error encountered during run, refer to logs for more details.
Presumably, the Vault CA certificate is the problem here, which might have been configured with an expiration of 1 year when Vault was installed.
Since the Keycloak pod was created 43 days ago, the pod has been restarted 3892 times.
Keycloak itself has no logs indicating big problems during the same timeframe.
Action items
- Fix the current issue
- Document how to fix this issue in the future in a runbook
- Expand the documentation in
kubernetes/namespaces/vault/README.md
as applicable
Out of scope for now
- Configure metrics endpoint for Vault to monitor for CA certificate lifetime (will open separate issue)
- Police DevOps members to take alerts seriously (electric shock therapy conflicts with Chris' pacemaker)
- Remove Keycloak
- Remove Vault
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Up next