Skip to content

add rules and alertmanager sidecar to cortex-helm-chart #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 18, 2021

Conversation

elliesaber
Copy link
Contributor

This PR uses the same pattern that the official Grafana chart uses to dynamically configure dashboards and datasources and notifier.
I tested adding both rules and alertmanager sidecar on my fork and it successfully discovered the rules and alertmanager config and got added to ruler and alertmanager.
The ConfigMaps can be put in the specified namespace and they are automatically detected and added as files to Ruler and/or AlertManager containers. This will allow easy and extensible configuration that avoids having to store state in the Cortex system itself.
This feature is disabled by default and below is an example of how it can be enabled in values.yaml

sidecar:
  rules:
    enabled: true
    searchNamespace: cortex-rules
  alertmanager:
    enabled: true
    searchNamespace: cortex-alertmanager

@elliesaber elliesaber force-pushed the ruler-alertmanager-sidecar branch from 3a6d686 to bc385f5 Compare May 11, 2021 07:00
Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing CHANGELOG.md documentation and please dont bump the version :)

@elliesaber elliesaber force-pushed the ruler-alertmanager-sidecar branch from 914cdb1 to 26b71bd Compare June 7, 2021 18:05
@nschad
Copy link
Collaborator

nschad commented Jun 7, 2021

Hi @amouhadi

We just merged #144. Due to the massive refactor you will have to rebase master and make sure that you don't actually add back the old files, since the location changed to a more easily maintainable folder structure

So for you should be templates/alertmanager/* and templates/ruler/*

Heads-up: We have deleted the role.yaml and the binding since we dropped support for PSP's and the fact that it didn't really do anything else

eamouhadi added 2 commits June 7, 2021 15:43
@elliesaber elliesaber force-pushed the ruler-alertmanager-sidecar branch from 9f303f5 to 8326acd Compare June 7, 2021 22:52
@elliesaber
Copy link
Contributor Author

Hi @amouhadi

We just merged #144. Due to the massive refactor you will have to rebase master and make sure that you don't actually add back the old files, since the location changed to a more easily maintainable folder structure

So for you should be templates/alertmanager/* and templates/ruler/*

Heads-up: We have deleted the role.yaml and the binding since we dropped support for PSP's and the fact that it didn't really do anything else

Thank you @ShuzZzle, I merged from master and resolved the conflicts.

Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to test a few more things, but looks good

Signed-off-by: eamouhadi <[email protected]>
@elliesaber elliesaber force-pushed the ruler-alertmanager-sidecar branch from 9627888 to 47c6ad3 Compare June 8, 2021 22:17
@elliesaber
Copy link
Contributor Author

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2021

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?

Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

Signed-off-by: eamouhadi <[email protected]>
@elliesaber
Copy link
Contributor Author

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?

Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/

For his multi-tenancy question, the configmap defines the tenant that the config ends up in:

kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2021

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?
Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/

For his multi-tenancy question, the configmap defines the tenant that the config ends up in:

kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake

Ah okay.

In AlertManager, the data_dir and local storage directory should be the same.In the Ruler, there needs to be two separate volumes

Why is that so? Where is the technical difference?

Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please recheck the chart? Make sure its actually running properly without adding any custom/extra stuff in your cluster. Maybe share a very minimalistic cortex configuration so people can very easily test it out.

Also the role binding is unfortunately missing

@gburton1
Copy link
Contributor

I want to test a few more things, but looks good

hi @ShuzZzle, did you have a chance to test things?

I am super unfimiliar with the concept and I didn't really had time to review it yet. Also it seems to me that some fields in the values.yml (such as "SCProvider") are never actually used?
Also how does this work with cortex's multi-tenancy. Everbody would get the same rules/alertmanager configs, right?

How it works is that we create rules in configmaps and they automatically get loaded into the ruler by the sidecar container. Same with AlertManager config.
here is a blogpost example how grafana uses sidecar https://johnharris.io/2019/03/dynamic-configuration-discovery-in-grafana/
For his multi-tenancy question, the configmap defines the tenant that the config ends up in:

kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake

Ah okay.

In AlertManager, the data_dir and local storage directory should be the same.In the Ruler, there needs to be two separate volumes

Why is that so? Where is the technical difference?

The Ruler allows you to pass static rules to it in a certain location. When it starts up, the Ruler consumes all the rules in that location (and continues polling it over time) and inserts them into its live rule config that it manages in a separate location. AlertManager is simpler--it consumes config from a single location.

Signed-off-by: eamouhadi <[email protected]>
@nschad
Copy link
Collaborator

nschad commented Jun 10, 2021

@gburton1 @amouhadi
Let me know when the missing role-binding is in :)

Signed-off-by: eamouhadi <[email protected]>
@elliesaber
Copy link
Contributor Author

elliesaber commented Jun 10, 2021

@gburton1 @amouhadi
Let me know when the missing role-binding is in :)

Added rolebinding.yaml :)

Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share a minimalistic cortex configuration with a configured sidecar? We have a very minimalistic working example in the ci folder of this repo where you can get started.

Signed-off-by: eamouhadi <[email protected]>
@gburton1
Copy link
Contributor

Cortex config:

alertmanager:
  external_url: /api/prom/alertmanager
  enable_api: true
  data_dir: /data/
  storage:
    type : local
    local:
      path: /data
api:
  prometheus_http_prefix: /prometheus
auth_enabled: false
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      consul:
        consistent_reads: true
        host: consul-consul-server.cortex:8500
        http_client_timeout: 20s
blocks_storage:
  azure:
    account_key: <redacted>
    account_name: <redacted>
    container_name: cortex
    endpoint_suffix: blob.core.usgovcloudapi.net
  backend: azure
  bucket_store:
    sync_dir: /data/tsdb-sync
    max_chunk_pool_bytes: 4294967296
  tsdb:
    dir: /data/tsdb
chunk_store:
  chunk_cache_config:
    memcached:
      expiration: 1h
    memcached_client:
      timeout: 1s
  max_look_back_period: 0s
distributor:
  pool:
    health_check_ingesters: true
  shard_by_all_labels: true
frontend:
  compress_responses: true
  log_queries_longer_than: 10s
ingester:
  lifecycler:
    final_sleep: 0s
    join_after: 0s
    num_tokens: 512
    ring:
      kvstore:
        consul:
          consistent_reads: true
          host: consul-consul-server.cortex:8500
          http_client_timeout: 20s
        prefix: collectors/
        store: consul
      replication_factor: 3
  max_transfer_retries: 0
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
limits:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
memberlist:
  join_members: []
frontend_worker:
  grpc_client_config:
    max_send_msg_size: 100000000
querier:
  active_query_tracker_dir: /data/cortex/querier
  query_ingesters_within: 12h
  store_gateway_addresses: dns+cortex-store-gateway-headless.cortex-components:9095
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 1h
      memcached_client:
        timeout: 6s
  split_queries_by_interval: 24h
ruler:
  enable_alertmanager_discovery: false
  enable_api: true
  rule_path: /data/rules
  storage:
    type : local
    local:
      directory: /tmp/rules
schema:
  configs: []
server:
  grpc_listen_port: 9095
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 8080
storage:
  azure:
    account_key: null
    account_name: null
    container_name: null
  cassandra:
    addresses: null
    auth: true
    keyspace: cortex
    password: null
    username: null
  engine: blocks
  index_queries_cache_config:
    memcached:
      expiration: 1h
    memcached_client:
      timeout: 1s

Helm chart override values:

        sidecar:
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 50m
              memory: 50Mi
          rules:
            enabled: true
            searchNamespace: cortex-rules
          alertmanager:
            enabled: true
            searchNamespace: cortex-alertmanager
        useExternalConfig: true
        externalConfigVersion: x
        tags:
          blocks-storage-memcached: true
        config:
          storage:
            engine: blocks
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: nginx
          hosts:
            - host: <redacted>
              paths:
                - /
          tls:
            - hosts:
              - <redacted>
        store_gateway:
          replicas: 6
          persistentVolume:
            storageClass: managed-premium
          # store_gateway was crashing on OOMKilled error for large datasets, default was 1Gi limit
          resources:
            limits:
              memory: 7Gi
            requests:
              memory: 4Gi
        compactor:
          persistentVolume:
            size: 256Gi
            storageClass: managed-premium
          # compactor was crash looping on OOMKilled error, default was 1Gi limit
          resources:
            limits:
              memory: 10Gi
            requests:
              memory: 5Gi
        querier:
          resources:
            limits:
              memory: 7Gi
            requests:
              memory: 4Gi
        ingester:
          # WAL needs to be stored to a persistent disk which can survive in the event of an ingester failure
          statefulSet:
            enabled: true
          persistentVolume:
            enabled: true
            size: 64Gi
            storageClass: managed-premium
          # crash looping on OOMKilled error, default was 1Gi limit
          resources:
            limits:
              memory: 12Gi
            requests:
              memory: 12Gi
          extraArgs:
            ingester.max-series-per-metric: 100000
        ruler:
          extraArgs:
            log.level: debug
        alertmanager:
          enabled: true
          statefulSet:
            enabled: true
          persistentVolume:
            size: 8Gi
            storageClass: managed-premium

Config map with a rule:

apiVersion: v1
data:
  high-cpu.yaml: |-
    groups:
      - name: high-cpu
        rules:
          - alert: HighCPUusage
            expr: avg(100 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100) by (instance) > 95
            for: 30m
            labels:
              severity: warning
            annotations:
              description: Metrics from {{ $labels.job }} on {{ $labels.instance }} show CPU > 95% for 30m.
              title: Node {{ $labels.instance }} has high CPU usage
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /tmp/rules/fake
  creationTimestamp: "2021-05-23T03:46:06Z"
  labels:
    argocd.argoproj.io/instance: grafana-telemetry-rules
    cortex_rules: "1"
  name: rules-cortex-9f99md47tc
  namespace: cortex-rules
  resourceVersion: "25639302"
  uid: 83fb43df-b3c8-4840-ad54-b11a4a1494f0

@nschad nschad force-pushed the ruler-alertmanager-sidecar branch from 405cc9f to 7002158 Compare June 12, 2021 16:26
@nschad nschad merged commit 52a402a into cortexproject:master Jun 18, 2021
@nschad nschad mentioned this pull request Jun 28, 2021
1 task
@vinitmasaun
Copy link

I have tried to incorporate the Alertmanager sidecar for configuration but the sidecar container is throwing the following error I have deployed the ConfigMap to the same namespace. This is on AWS EKS version 1.19:

[2022-01-04 23:23:00] MaxRetryError when calling kubernetes: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /api/v1/namespaces/cortex/configmaps?labelSelector=app%3Dcortex-alertmanager&watch=True (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7f40ed3040>: Failed to establish a new connection: [Errno 111] Connection refused'))

Following is my ConfigMap yaml:

apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-alertmanager
namespace: cortex
labels:
app: cortex-alertmanager
data:
alertmanager_config: |
route:
group_by:
- alertname
- instance
- severity
- job
- namespace
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'splunk_webhook'
routes:
- receiver: 'splunk_webhook'
continue: true
receivers:
- name: 'splunk_webhook'
webhook_configs:
- url: 'redacted'
send_resolved: true
http_config:
basic_auth:
username: 'redacted'
password: 'redacted'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

@gburton1
Copy link
Contributor

gburton1 commented Jan 4, 2022

localhost:80 is not going to be the right place to reach the kubernetes API. It should be going through the kubernetes service in the default namespace. And that should resolve to something like 10.0.0.1, so your requests from the sidecar should be something like: GET https://10.0.0.1:443/api/v1...

We have been running the AlertManager and Ruler sidecar configurations in production for a while now.

@vinitmasaun
Copy link

localhost:80 is not going to be the right place to reach the kubernetes API. It should be going through the kubernetes service in the default namespace. And that should resolve to something like 10.0.0.1, so your requests from the sidecar should be something like: GET https://10.0.0.1:443/api/v1...

We have been running the AlertManager and Ruler sidecar configurations in production for a while now.

Right, but how can I control the kube API url that the sidecar is constructing? Is there a values.yaml entry I should be setting for this in addition to the sidecar section? I am using cortex helm chart version 1.2.0. Following is my values.yaml snippet for Alertmanager sidecar:

alertmanager:
replicas: 1
statefulSet:
enabled: true
sidecar:
image:
repository: quay.io/kiwigrid/k8s-sidecar
enabled: true
searchNamespace: cortex
skipTlsVerify: true
label: Name
labelValue: cortex-alertmanager
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
persistentVolume:
enabled: true

@vinitmasaun
Copy link

it looks like the issue with the localhost:80 is with a bug in the kiwigrid/k8s-sidecar image tag 1.10.1 (kiwigrid/k8s-sidecar#114) I updated the tag value in my values.yaml to 1.11.1 and that resolved the kube api url issue however: now I am getting a permission denied error when the ConfigMap is downloaded and the sidecar is trying to store it in the data folder:

[2022-01-05 02:04:35] Working on ADDED configmap cortex/cortex-alertmanager
[2022-01-05 02:04:35] Received unknown exception: [Errno 13] Permission denied: '/data/alertmanager_config'

I've tried changing the following values in the values.yaml but neither seem to resolve the permission denied error:

alertmanager:
replicas: 1
statefulSet:
enabled: true
sidecar:
image:
repository: quay.io/kiwigrid/k8s-sidecar
tag: 1.11.1
enabled: true
searchNamespace: cortex
skipTlsVerify: true
label: cortex_alertmanager
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
containerSecurityContext:
enabled: false
readOnlyRootFilesystem: false

persistentVolume:
enabled: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants