add rules and alertmanager sidecar to cortex-helm-chart #150

elliesaber · 2021-05-11T04:45:39Z

This PR uses the same pattern that the official Grafana chart uses to dynamically configure dashboards and datasources and notifier.
I tested adding both rules and alertmanager sidecar on my fork and it successfully discovered the rules and alertmanager config and got added to ruler and alertmanager.
The ConfigMaps can be put in the specified namespace and they are automatically detected and added as files to Ruler and/or AlertManager containers. This will allow easy and extensible configuration that avoids having to store state in the Cortex system itself.
This feature is disabled by default and below is an example of how it can be enabled in values.yaml

sidecar:
  rules:
    enabled: true
    searchNamespace: cortex-rules
  alertmanager:
    enabled: true
    searchNamespace: cortex-alertmanager

Signed-off-by: eamouhadi <[email protected]>

nschad

Missing CHANGELOG.md documentation and please dont bump the version :)

docs/index.yaml

Signed-off-by: eamouhadi <[email protected]>

nschad · 2021-06-07T21:02:05Z

Hi @amouhadi

We just merged #144. Due to the massive refactor you will have to rebase master and make sure that you don't actually add back the old files, since the location changed to a more easily maintainable folder structure

So for you should be templates/alertmanager/* and templates/ruler/*

Heads-up: We have deleted the role.yaml and the binding since we dropped support for PSP's and the fact that it didn't really do anything else

Signed-off-by: eamouhadi <[email protected]>

elliesaber · 2021-06-07T22:55:41Z

Hi @amouhadi

We just merged #144. Due to the massive refactor you will have to rebase master and make sure that you don't actually add back the old files, since the location changed to a more easily maintainable folder structure

So for you should be templates/alertmanager/* and templates/ruler/*

Heads-up: We have deleted the role.yaml and the binding since we dropped support for PSP's and the fact that it didn't really do anything else

Thank you @ShuzZzle, I merged from master and resolved the conflicts.

nschad

I want to test a few more things, but looks good

Signed-off-by: eamouhadi <[email protected]>

elliesaber · 2021-06-10T16:20:18Z

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

Signed-off-by: ShuzZzle <[email protected]>

nschad · 2021-06-10T16:58:59Z

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?

Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

values.yaml

Signed-off-by: eamouhadi <[email protected]>

elliesaber · 2021-06-10T19:00:31Z

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?

Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/

For his multi-tenancy question, the configmap defines the tenant that the config ends up in:

kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake

nschad · 2021-06-10T19:02:59Z

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?
Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/

For his multi-tenancy question, the configmap defines the tenant that the config ends up in:
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake

Ah okay.

In AlertManager, the data_dir and local storage directory should be the same.In the Ruler, there needs to be two separate volumes

Why is that so? Where is the technical difference?

templates/ruler/ruler-dep.yaml

nschad

Can you please recheck the chart? Make sure its actually running properly without adding any custom/extra stuff in your cluster. Maybe share a very minimalistic cortex configuration so people can very easily test it out.

Also the role binding is unfortunately missing

gburton1 · 2021-06-10T19:56:19Z

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?
Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/
For his multi-tenancy question, the configmap defines the tenant that the config ends up in:
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake
Ah okay.

In AlertManager, the data_dir and local storage directory should be the same.In the Ruler, there needs to be two separate volumes

Why is that so? Where is the technical difference?

The Ruler allows you to pass static rules to it in a certain location. When it starts up, the Ruler consumes all the rules in that location (and continues polling it over time) and inserts them into its live rule config that it manages in a separate location. AlertManager is simpler--it consumes config from a single location.

Signed-off-by: eamouhadi <[email protected]>

nschad · 2021-06-10T20:08:37Z

@gburton1 @amouhadi
Let me know when the missing role-binding is in :)

Signed-off-by: eamouhadi <[email protected]>

elliesaber · 2021-06-10T22:47:34Z

@gburton1 @amouhadi
Let me know when the missing role-binding is in :)

Added rolebinding.yaml :)

nschad

Can you please share a minimalistic cortex configuration with a configured sidecar? We have a very minimalistic working example in the ci folder of this repo where you can get started.

README.md

Signed-off-by: eamouhadi <[email protected]>

gburton1 · 2021-06-12T03:43:39Z

Cortex config:

alertmanager:
  external_url: /api/prom/alertmanager
  enable_api: true
  data_dir: /data/
  storage:
    type : local
    local:
      path: /data
api:
  prometheus_http_prefix: /prometheus
auth_enabled: false
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      consul:
        consistent_reads: true
        host: consul-consul-server.cortex:8500
        http_client_timeout: 20s
blocks_storage:
  azure:
    account_key: <redacted>
    account_name: <redacted>
    container_name: cortex
    endpoint_suffix: blob.core.usgovcloudapi.net
  backend: azure
  bucket_store:
    sync_dir: /data/tsdb-sync
    max_chunk_pool_bytes: 4294967296
  tsdb:
    dir: /data/tsdb
chunk_store:
  chunk_cache_config:
    memcached:
      expiration: 1h
    memcached_client:
      timeout: 1s
  max_look_back_period: 0s
distributor:
  pool:
    health_check_ingesters: true
  shard_by_all_labels: true
frontend:
  compress_responses: true
  log_queries_longer_than: 10s
ingester:
  lifecycler:
    final_sleep: 0s
    join_after: 0s
    num_tokens: 512
    ring:
      kvstore:
        consul:
          consistent_reads: true
          host: consul-consul-server.cortex:8500
          http_client_timeout: 20s
        prefix: collectors/
        store: consul
      replication_factor: 3
  max_transfer_retries: 0
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
limits:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
memberlist:
  join_members: []
frontend_worker:
  grpc_client_config:
    max_send_msg_size: 100000000
querier:
  active_query_tracker_dir: /data/cortex/querier
  query_ingesters_within: 12h
  store_gateway_addresses: dns+cortex-store-gateway-headless.cortex-components:9095
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 1h
      memcached_client:
        timeout: 6s
  split_queries_by_interval: 24h
ruler:
  enable_alertmanager_discovery: false
  enable_api: true
  rule_path: /data/rules
  storage:
    type : local
    local:
      directory: /tmp/rules
schema:
  configs: []
server:
  grpc_listen_port: 9095
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 8080
storage:
  azure:
    account_key: null
    account_name: null
    container_name: null
  cassandra:
    addresses: null
    auth: true
    keyspace: cortex
    password: null
    username: null
  engine: blocks
  index_queries_cache_config:
    memcached:
      expiration: 1h
    memcached_client:
      timeout: 1s

Helm chart override values:

        sidecar:
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 50m
              memory: 50Mi
          rules:
            enabled: true
            searchNamespace: cortex-rules
          alertmanager:
            enabled: true
            searchNamespace: cortex-alertmanager
        useExternalConfig: true
        externalConfigVersion: x
        tags:
          blocks-storage-memcached: true
        config:
          storage:
            engine: blocks
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: nginx
          hosts:
            - host: <redacted>
              paths:
                - /
          tls:
            - hosts:
              - <redacted>
        store_gateway:
          replicas: 6
          persistentVolume:
            storageClass: managed-premium
          # store_gateway was crashing on OOMKilled error for large datasets, default was 1Gi limit
          resources:
            limits:
              memory: 7Gi
            requests:
              memory: 4Gi
        compactor:
          persistentVolume:
            size: 256Gi
            storageClass: managed-premium
          # compactor was crash looping on OOMKilled error, default was 1Gi limit
          resources:
            limits:
              memory: 10Gi
            requests:
              memory: 5Gi
        querier:
          resources:
            limits:
              memory: 7Gi
            requests:
              memory: 4Gi
        ingester:
          # WAL needs to be stored to a persistent disk which can survive in the event of an ingester failure
          statefulSet:
            enabled: true
          persistentVolume:
            enabled: true
            size: 64Gi
            storageClass: managed-premium
          # crash looping on OOMKilled error, default was 1Gi limit
          resources:
            limits:
              memory: 12Gi
            requests:
              memory: 12Gi
          extraArgs:
            ingester.max-series-per-metric: 100000
        ruler:
          extraArgs:
            log.level: debug
        alertmanager:
          enabled: true
          statefulSet:
            enabled: true
          persistentVolume:
            size: 8Gi
            storageClass: managed-premium

Config map with a rule:

apiVersion: v1
data:
  high-cpu.yaml: |-
    groups:
      - name: high-cpu
        rules:
          - alert: HighCPUusage
            expr: avg(100 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100) by (instance) > 95
            for: 30m
            labels:
              severity: warning
            annotations:
              description: Metrics from {{ $labels.job }} on {{ $labels.instance }} show CPU > 95% for 30m.
              title: Node {{ $labels.instance }} has high CPU usage
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake
  creationTimestamp: "2021-05-23T03:46:06Z"
  labels:
    argocd.argoproj.io/instance: grafana-telemetry-rules
    cortex_rules: "1"
  name: rules-cortex-9f99md47tc
  namespace: cortex-rules
  resourceVersion: "25639302"
  uid: 83fb43df-b3c8-4840-ad54-b11a4a1494f0

Signed-off-by: ShuzZzle <[email protected]>

values.yaml

… labels Signed-off-by: ShuzZzle <[email protected]>

Signed-off-by: eamouhadi <[email protected]>

CHANGELOG.md

templates/alertmanager/alertmanager-dep.yaml

templates/alertmanager/alertmanager-statefulset.yaml

Signed-off-by: eamouhadi <[email protected]>

Signed-off-by: ShuzZzle <[email protected]>

vinitmasaun · 2022-01-04T23:28:58Z

I have tried to incorporate the Alertmanager sidecar for configuration but the sidecar container is throwing the following error I have deployed the ConfigMap to the same namespace. This is on AWS EKS version 1.19:

[2022-01-04 23:23:00] MaxRetryError when calling kubernetes: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /api/v1/namespaces/cortex/configmaps?labelSelector=app%3Dcortex-alertmanager&watch=True (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f40ed3040>: Failed to establish a new connection: [Errno 111] Connection refused'))

Following is my ConfigMap yaml:

apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-alertmanager
namespace: cortex
labels:
app: cortex-alertmanager
data:
alertmanager_config: |
route:
group_by:
- alertname
- instance
- severity
- job
- namespace
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'splunk_webhook'
routes:
- receiver: 'splunk_webhook'
continue: true
receivers:
- name: 'splunk_webhook'
webhook_configs:
- url: 'redacted'
send_resolved: true
http_config:
basic_auth:
username: 'redacted'
password: 'redacted'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

gburton1 · 2022-01-04T23:49:58Z

localhost:80 is not going to be the right place to reach the kubernetes API. It should be going through the kubernetes service in the default namespace. And that should resolve to something like 10.0.0.1, so your requests from the sidecar should be something like: GET https://10.0.0.1:443/api/v1...

We have been running the AlertManager and Ruler sidecar configurations in production for a while now.

vinitmasaun · 2022-01-04T23:54:49Z

localhost:80 is not going to be the right place to reach the kubernetes API. It should be going through the kubernetes service in the default namespace. And that should resolve to something like 10.0.0.1, so your requests from the sidecar should be something like: GET https://10.0.0.1:443/api/v1...

We have been running the AlertManager and Ruler sidecar configurations in production for a while now.

Right, but how can I control the kube API url that the sidecar is constructing? Is there a values.yaml entry I should be setting for this in addition to the sidecar section? I am using cortex helm chart version 1.2.0. Following is my values.yaml snippet for Alertmanager sidecar:

alertmanager:
replicas: 1
statefulSet:
enabled: true
sidecar:
image:
repository: quay.io/kiwigrid/k8s-sidecar
enabled: true
searchNamespace: cortex
skipTlsVerify: true
label: Name
labelValue: cortex-alertmanager
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
persistentVolume:
enabled: true

vinitmasaun · 2022-01-05T02:07:45Z

it looks like the issue with the localhost:80 is with a bug in the kiwigrid/k8s-sidecar image tag 1.10.1 (kiwigrid/k8s-sidecar#114) I updated the tag value in my values.yaml to 1.11.1 and that resolved the kube api url issue however: now I am getting a permission denied error when the ConfigMap is downloaded and the sidecar is trying to store it in the data folder:

[2022-01-05 02:04:35] Working on ADDED configmap cortex/cortex-alertmanager
[2022-01-05 02:04:35] Received unknown exception: [Errno 13] Permission denied: '/data/alertmanager_config'

I've tried changing the following values in the values.yaml but neither seem to resolve the permission denied error:

alertmanager:
replicas: 1
statefulSet:
enabled: true
sidecar:
image:
repository: quay.io/kiwigrid/k8s-sidecar
tag: 1.11.1
enabled: true
searchNamespace: cortex
skipTlsVerify: true
label: cortex_alertmanager
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
containerSecurityContext:
enabled: false
readOnlyRootFilesystem: false
persistentVolume:
enabled: true

elliesaber mentioned this pull request May 11, 2021

Refactor chart & change labels to best practices #144

Merged

add rules and alertmanager sidecar to cortex-helm-chart

bc385f5

Signed-off-by: eamouhadi <[email protected]>

elliesaber force-pushed the ruler-alertmanager-sidecar branch from 3a6d686 to bc385f5 Compare May 11, 2021 07:00

nschad requested changes Jun 5, 2021

View reviewed changes

docs/index.yaml Outdated Show resolved Hide resolved

modify version and CHANGELOG.md

26b71bd

Signed-off-by: eamouhadi <[email protected]>

elliesaber force-pushed the ruler-alertmanager-sidecar branch from 914cdb1 to 26b71bd Compare June 7, 2021 18:05

Merge branch 'master' into ruler-alertmanager-sidecar

405dc7c

eamouhadi added 2 commits June 7, 2021 15:43

merge from master

e81dbbf

cleanup CHANGELOG.md

8326acd

Signed-off-by: eamouhadi <[email protected]>

elliesaber force-pushed the ruler-alertmanager-sidecar branch from 9f303f5 to 8326acd Compare June 7, 2021 22:52

Merge branch 'master' into ruler-alertmanager-sidecar

726e1b1

nschad approved these changes Jun 8, 2021

View reviewed changes

add new line

47c6ad3

Signed-off-by: eamouhadi <[email protected]>

elliesaber force-pushed the ruler-alertmanager-sidecar branch from 9627888 to 47c6ad3 Compare June 8, 2021 22:17

remove empty chart dependency

a96ebc4

Signed-off-by: ShuzZzle <[email protected]>

nschad reviewed Jun 10, 2021

View reviewed changes

values.yaml Outdated Show resolved Hide resolved

remove unused values

31f4ba9

Signed-off-by: eamouhadi <[email protected]>

nschad reviewed Jun 10, 2021

View reviewed changes

templates/ruler/ruler-dep.yaml Show resolved Hide resolved

nschad requested changes Jun 10, 2021

View reviewed changes

add emptyDir

ead3360

Signed-off-by: eamouhadi <[email protected]>

add rolebinding.yaml

1a218bd

Signed-off-by: eamouhadi <[email protected]>

nschad reviewed Jun 11, 2021

View reviewed changes

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

README.md Show resolved Hide resolved

update readme example

e07e0ac

Signed-off-by: eamouhadi <[email protected]>

fixxed rule example in README

7002158

Signed-off-by: ShuzZzle <[email protected]>

nschad force-pushed the ruler-alertmanager-sidecar branch from 405cc9f to 7002158 Compare June 12, 2021 16:26

nschad reviewed Jun 12, 2021

View reviewed changes

values.yaml Outdated Show resolved Hide resolved

nschad and others added 4 commits June 14, 2021 19:07

refactor rolebinding and role so that they now use the best-practices…

262acab

… labels Signed-off-by: ShuzZzle <[email protected]>

rescope the sidecar

208db40

Signed-off-by: eamouhadi <[email protected]>

Merge branch 'master' into ruler-alertmanager-sidecar

c222a18

fix ruler and alertmanager sidecar labels

a97224e

Signed-off-by: eamouhadi <[email protected]>

nschad reviewed Jun 15, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

templates/alertmanager/alertmanager-dep.yaml Outdated Show resolved Hide resolved

templates/alertmanager/alertmanager-statefulset.yaml Outdated Show resolved Hide resolved

eamouhadi added 2 commits June 15, 2021 10:37

remove unnecessary subPath-ipdate changelog

efc1f46

Signed-off-by: eamouhadi <[email protected]>

replace role and rolebinding with clusterrole and clusterrolebinding

cb55f72

Signed-off-by: eamouhadi <[email protected]>

nschad approved these changes Jun 18, 2021

View reviewed changes

remove trailing and whitespaces from values.yml

6c8a636

Signed-off-by: ShuzZzle <[email protected]>

nschad approved these changes Jun 18, 2021

View reviewed changes

nschad merged commit 52a402a into cortexproject:master Jun 18, 2021

nschad mentioned this pull request Jun 28, 2021

cut 0.6.0 release #173

Merged

1 task

sbaier1 mentioned this pull request Jun 28, 2022

alertmanager <> ruler integration not working in multi-tenancy mode grafana/mimir#2244

Closed

add rules and alertmanager sidecar to cortex-helm-chart #150

add rules and alertmanager sidecar to cortex-helm-chart #150

Uh oh!

Conversation

elliesaber commented May 11, 2021

Uh oh!

nschad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nschad commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliesaber commented Jun 7, 2021

Uh oh!

nschad left a comment

Choose a reason for hiding this comment

Uh oh!

elliesaber commented Jun 10, 2021

Uh oh!

nschad commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

elliesaber commented Jun 10, 2021

Uh oh!

nschad commented Jun 10, 2021

Uh oh!

Uh oh!

nschad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gburton1 commented Jun 10, 2021

Uh oh!

nschad commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliesaber commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nschad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gburton1 commented Jun 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinitmasaun commented Jan 4, 2022

Uh oh!

gburton1 commented Jan 4, 2022

Uh oh!

vinitmasaun commented Jan 4, 2022

Uh oh!

vinitmasaun commented Jan 5, 2022

Uh oh!

Uh oh!

nschad commented Jun 7, 2021 •

edited

Loading

nschad commented Jun 10, 2021 •

edited

Loading

nschad left a comment •

edited

Loading

nschad commented Jun 10, 2021 •

edited

Loading

elliesaber commented Jun 10, 2021 •

edited

Loading