Skip to content

Dicedb Opentelemetry integration with prometheus scraping #1462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

gsarmaonline
Copy link
Contributor

@gsarmaonline gsarmaonline commented Feb 7, 2025

PR - #1135

First PR for integrating prometheus scraping using Opentelemetry exporter.

DiceDB Observability with Opentelemetry (Otel)

This module implements the observability layer required for DiceDB to emit metrics.
The current implementation integrates Opentelemetry with Prometheus in scraping mode.
The current Prometheus exporter also exposes a HTTP client which Prometheus can leverage
to scrape data from.

Refer to the README.md file in the PR for more information.

@gsarmaonline
Copy link
Contributor Author

cc @JyotinderSingh @arpitbbhayani

@gsarmaonline gsarmaonline force-pushed the gsarma/opentel-prometheus branch from ebc770b to c45d01b Compare February 8, 2025 06:27
Comment on lines 24 to 32
The observability module should be available from all other modules. There should always
be one instance of the observability module running.
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go).
```go
func (dotel *DiceOtel) register() (err error) {
DiceStartCounter, _ = dotel.Meter.Int64Counter("dicedb_start", api.WithDescription("A counter for the start of the DiceDB server"))
// Add other metrics here
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis could just be a global variable that gets initialized during the bootup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics variable should be available globally but they should also be defined only after the DiceotelSrv object is defined.
I am thinking we can follow this approach:

  • Expose the DiceotelSrv globally to the dicedb package
  • Initialise the metrics inside the diceotel object and register them when the diceotel object is being initialised

This allows us to ensure that the list of global variables don't start piling up for metrics and can be accessed
through the DiceotelSrv global reference.


Access it anywhere in the repo and call the required functions:
```go
diceotel.DiceStartCounter.Add(ctx, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not keen on prefixing Dice for all metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing it.

be one instance of the observability module running.
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go).
```go
func (dotel *DiceOtel) register() (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of classes and objects here given this can just be a global variable.

Comment on lines 20 to 30
type (
DiceOtel struct {
ctx context.Context

Meter api.Meter
}
)

var (
DiceotelSrv *DiceOtel
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is typical java. I do not see a value add. If we need it, we can always add it later.

Comment on lines 36 to 41
if err = dotel.setup(); err != nil {
return
}
if err = dotel.register(); err != nil {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't both of these be in one function, why have two unnecessarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

// The exporter embeds a default OpenTelemetry Reader and
// implements prometheus.Collector, allowing it to be used as
// both a Reader and Collector.
exporter, err := prometheus.New()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we taking a strong dependency on prometheus here? given this is otel, this should be pluggable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case there is a requirement to move the metrics from one provider to another, changing the Exporters should be sufficient without having to changing the existing application code where the metrics are emitted.

So the prometheus exporter is pluggable and can be changed by adding another exporter type

@@ -42,10 +46,20 @@ func (t *IOThread) StartSync(ctx context.Context, shardManager *ShardManager, wa
_c.ClientID = t.ClientID
_c.Mode = t.Mode

execStartTime := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we call this variable __telXXXTime, this way we can follow the same variable name through the codebase. Calling this execStartTime here and something else at another place will create confusion. Open for other shorter variable names.

We can also go for the convention that __telXXXTime is all things telemetry. So, any and every variable that starts with this can be easily identified as a telemetry variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment on lines 54 to 62
diceotel.DiceCmdLatencyInMsHistogram.Record(
ctx,
time.Since(execStartTime).Milliseconds(),
metric.WithAttributeSet(
attribute.NewSet(
attribute.String("cmd", c.Cmd),
),
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great UX as a database engineer. Liked it.

Concerns are only around the naming convention. Two things

  1. We should also add a non-docker tutorial on seeing these metrics in action, checked in the repository.
  2. why do we have histogram in the name? this seems pretty weird. The metric registering should be about what we are registering and not how it would be displayed or processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For point 1, yes, let me add the non-docker tutorial as well for this.

For point 2, defining a metric as a histogram also adds more percentile wise information as well.
For example, the above command is represented in the following buckets:

➜  dice git:(gsarma/opentel-prometheus) curl http://localhost:8050/metrics | grep latency
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9966    0  9966    0     0  14.0M      0 --:--:-- --:--:-- --:--:-- 9732k
# HELP dicedb_command_exec_latency_milliseconds 
# TYPE dicedb_command_exec_latency_milliseconds histogram
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="0"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="25"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="50"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="75"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="100"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="250"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="750"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="1000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="2500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="7500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="+Inf"} 1
dicedb_command_exec_latency_milliseconds_sum{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 0
dicedb_command_exec_latency_milliseconds_count{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 1

So we primarily have counter, gauge and histograms as different metric types.

@gsarmaonline
Copy link
Contributor Author

@arpitbbhayani Thanks for the thorough review here.
I have tried to answer most of your comments. Please take a look whenever convenient.

@gsarmaonline
Copy link
Contributor Author

@arpitbbhayani I have updated the PR with the discussed changes. PTAL. Thanks!

@CLAassistant
Copy link

CLAassistant commented Mar 20, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants