-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Dicedb Opentelemetry integration with prometheus scraping #1462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Dicedb Opentelemetry integration with prometheus scraping #1462
Conversation
ebc770b
to
c45d01b
Compare
internal/diceotel/README.md
Outdated
The observability module should be available from all other modules. There should always | ||
be one instance of the observability module running. | ||
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go). | ||
```go | ||
func (dotel *DiceOtel) register() (err error) { | ||
DiceStartCounter, _ = dotel.Meter.Int64Counter("dicedb_start", api.WithDescription("A counter for the start of the DiceDB server")) | ||
// Add other metrics here | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis could just be a global variable that gets initialized during the bootup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics variable should be available globally but they should also be defined only after the DiceotelSrv
object is defined.
I am thinking we can follow this approach:
- Expose the
DiceotelSrv
globally to the dicedb package - Initialise the metrics inside the diceotel object and register them when the diceotel object is being initialised
This allows us to ensure that the list of global variables don't start piling up for metrics and can be accessed
through the DiceotelSrv
global reference.
internal/diceotel/README.md
Outdated
|
||
Access it anywhere in the repo and call the required functions: | ||
```go | ||
diceotel.DiceStartCounter.Add(ctx, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not keen on prefixing Dice
for all metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removing it.
internal/diceotel/README.md
Outdated
be one instance of the observability module running. | ||
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go). | ||
```go | ||
func (dotel *DiceOtel) register() (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge fan of classes and objects here given this can just be a global variable.
internal/diceotel/diceotel.go
Outdated
type ( | ||
DiceOtel struct { | ||
ctx context.Context | ||
|
||
Meter api.Meter | ||
} | ||
) | ||
|
||
var ( | ||
DiceotelSrv *DiceOtel | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is typical java. I do not see a value add. If we need it, we can always add it later.
internal/diceotel/diceotel.go
Outdated
if err = dotel.setup(); err != nil { | ||
return | ||
} | ||
if err = dotel.register(); err != nil { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't both of these be in one function, why have two unnecessarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
internal/diceotel/diceotel.go
Outdated
// The exporter embeds a default OpenTelemetry Reader and | ||
// implements prometheus.Collector, allowing it to be used as | ||
// both a Reader and Collector. | ||
exporter, err := prometheus.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we taking a strong dependency on prometheus here? given this is otel, this should be pluggable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case there is a requirement to move the metrics from one provider to another, changing the Exporters should be sufficient without having to changing the existing application code where the metrics are emitted.
So the prometheus exporter is pluggable and can be changed by adding another exporter type
internal/server/ironhawk/iothread.go
Outdated
@@ -42,10 +46,20 @@ func (t *IOThread) StartSync(ctx context.Context, shardManager *ShardManager, wa | |||
_c.ClientID = t.ClientID | |||
_c.Mode = t.Mode | |||
|
|||
execStartTime := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we call this variable __telXXXTime
, this way we can follow the same variable name through the codebase. Calling this execStartTime
here and something else at another place will create confusion. Open for other shorter variable names.
We can also go for the convention that __telXXXTime
is all things telemetry. So, any and every variable that starts with this can be easily identified as a telemetry variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
internal/server/ironhawk/iothread.go
Outdated
diceotel.DiceCmdLatencyInMsHistogram.Record( | ||
ctx, | ||
time.Since(execStartTime).Milliseconds(), | ||
metric.WithAttributeSet( | ||
attribute.NewSet( | ||
attribute.String("cmd", c.Cmd), | ||
), | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great UX as a database engineer. Liked it.
Concerns are only around the naming convention. Two things
- We should also add a non-docker tutorial on seeing these metrics in action, checked in the repository.
- why do we have
histogram
in the name? this seems pretty weird. The metric registering should be about what we are registering and not how it would be displayed or processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For point 1, yes, let me add the non-docker tutorial as well for this.
For point 2, defining a metric as a histogram also adds more percentile wise information as well.
For example, the above command is represented in the following buckets:
➜ dice git:(gsarma/opentel-prometheus) curl http://localhost:8050/metrics | grep latency
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9966 0 9966 0 0 14.0M 0 --:--:-- --:--:-- --:--:-- 9732k
# HELP dicedb_command_exec_latency_milliseconds
# TYPE dicedb_command_exec_latency_milliseconds histogram
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="0"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="25"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="50"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="75"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="100"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="250"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="750"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="1000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="2500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="7500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="+Inf"} 1
dicedb_command_exec_latency_milliseconds_sum{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 0
dicedb_command_exec_latency_milliseconds_count{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 1
So we primarily have counter, gauge and histograms as different metric types.
@arpitbbhayani Thanks for the thorough review here. |
@arpitbbhayani I have updated the PR with the discussed changes. PTAL. Thanks! |
PR - #1135
First PR for integrating prometheus scraping using Opentelemetry exporter.
DiceDB Observability with Opentelemetry (Otel)
This module implements the observability layer required for DiceDB to emit metrics.
The current implementation integrates Opentelemetry with Prometheus in scraping mode.
The current Prometheus exporter also exposes a HTTP client which Prometheus can leverage
to scrape data from.
Refer to the README.md file in the PR for more information.