Skip to content

Ruler doesn't track internal errors vs user errors correctly #4333

Closed
@pstibrany

Description

@pstibrany

Recent PR #4281 has introduced new metrics for tracking errors from ruler. Idea was that these failure metrics will only be incremented for queries that would normally result in 500 errors.

However this doesn't quite work. There is a class of errors that querier would return 422 for, but ruler still treats as internal errors. Examples include a rule that ends with multiple matches for labels: many-to-one matching must be explicit (group_left/group_right) errors, or a rule like label_replace(metric, "foo", "$1", "service", "[") with invalid regex "[".

Why do these return 422, but ruler doesn't recognize them as user-errors? Ruler uses querier.TranslateToPromqlAPIError to translate errors to PromQL errors. This function was designed to be used by querier, for returning errors from storage.Queryable implementation that match errors expected by Prometheus API library.

However in case of above mentioned class of errors, they don't "go" through the storage layer, but are generated by PromQL engine itself. As such, these errors are internal, but Prometheus API layer can translate them to proper status codes. Unfortunately Cortex ruler doesn't use Prometheus API library to run the queries, and cannot currently do type assertions on these internal Prometheus errors.

(By Prometheus API library/layer I mean github.com/prometheus/prometheus/web/api/v1 package that implements HTTP API in Prometheus.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions