Skip to content

[RFC] [Semantic Convention] "Job" traces #1582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

johnnyshields
Copy link

@johnnyshields johnnyshields commented Mar 27, 2021

This PR defines a semantic convention for "Job" traces.

A "Job" models a batch job or task, which can be either enqueued, scheduled, or run on-demand.

Examples of job-related systems are:

The spec for "Job" would be somewhere between "Messaging" and "FAAS". It would cover both the producer and consumer aspects of running jobs.

As noted above, OTEL instrumentation libraries are using the "Messaging" convention to represent Jobs today. However "Messaging", which is intended for Kafka, RabbitMQ, etc. is not a great fit:

  • Messaging systems focus on the delivery of messages irrespective of the message contents. Jobs performing work/processing based on the job instruction.
  • In a Messaging system the categorization is done by the "queue" or "topic" on which messages are transmitted; in a trace viewing system (e.g. Datadog, Lightstep, etc.) I would want to see messages labelled according to their queue. In Job systems, the categorization is based on job name/type, and the queue is just for worker (consumer) resource allocation. Hence I would want traces labelled as "ChargePaymentJob", "RefundPaymentJob", etc. even though all such jobs are in the same "payments" queue.

FAAS is also not ideal, as it is typically for serverless providers such as AWS Lambda which the infrastructure running the job has been abstracted away, and the primary focus is on the "function".

@johnnyshields johnnyshields requested review from a team March 27, 2021 06:54
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 27, 2021

CLA Signed

The committers are authorized under a signed CLA.

@johnnyshields johnnyshields force-pushed the semantic-convention-trace-job branch from f420fcb to 5208277 Compare March 27, 2021 06:56
@johnnyshields
Copy link
Author

@iNikem
Copy link
Contributor

iNikem commented Mar 28, 2021

I think we also need to do some modeling work. E.g. to answer the following questions:

  • What SpanKind has the root span?
  • If job executor reads a batch of tasks and splits it into chunks, should we have nested spans as well? Should they have the same semantic conventions? Should we have chunkId/itemId?
  • How to decide when use messaging convention and when this new job? E.g. Kafka consumer listening a topic and then doing some job, is it messaging or job? Why?

@johnnyshields
Copy link
Author

johnnyshields commented Mar 28, 2021

@iNikem

What SpanKind has the root span?

For DelayedJob, Rake, Airflow, etc. it would be SpanKind.CONSUMER for the process that performs jobs. Rake/Airflow run on-demand or by a timer aren't exactly "consumers" but it's the best fit out of all the existing span-kind values we have.

Any code that enqueues (and does not execute) the jobs would be SpanKind.PRODUCER.

If job executor reads a batch of tasks and splits it into chunks, should we have nested spans as well? Should they have the same semantic conventions? Should we have chunkId/itemId?

In practice, child jobs are typically enqueued as new jobs with new job_ids. It would be make more sense to do this as a parent_job_id attribute which would be common across all child jobs spawned from the original master job. Since each child is executed asynchronously in it's own process and thread they would not appear as "nested" with the parent. AFAIK DelayedJob/Resque/etc don't have such an attribute today but Airflow likely does as it is DAG-based.

How to decide when use messaging convention and when this new job? E.g. Kafka consumer listening a topic and then doing some job, is it messaging or job? Why?

A process which listens/consumes from Kafka and then performs a job synchronously is probably traced as a Messaging consumer because it's Kafka listening library will be instrumented. Theoretically you could also add in a Job span but it would be redundant with the message consumer.

@johnnyshields johnnyshields changed the title [Semantic Convention] "Job" traces [RFC] [Semantic Convention] "Job" traces Mar 28, 2021
@iNikem
Copy link
Contributor

iNikem commented Mar 28, 2021

In practice, child jobs are typically enqueued as new jobs with new job_ids. It would be make more sense to do this as a parent_job_id attribute which would be common across all child jobs spawned from the original master job. Since each child is executed asynchronously in it's own process and thread they would not appear as "nested" with the parent. AFAIK DelayedJob/Resque/etc don't have such an attribute today but Airflow likely does as it is DAG-based.

I am more familiar with https://spring.io/projects/spring-batch which, it seems to me, follows different approach, hence my questions.

@iNikem
Copy link
Contributor

iNikem commented Mar 29, 2021

Btw, @johnnyshields have you seen open-telemetry/semantic-conventions#1640 ?

@johnnyshields
Copy link
Author

@iNikem no i had not seen that, funny because I searched a bit. I think it is roughly compatible with this one.

@johnnyshields
Copy link
Author

@mateuszrzeszutek

@arminru arminru added the area:semantic-conventions Related to semantic conventions label Mar 29, 2021
@mateuszrzeszutek
Copy link
Member

This roughly corresponds to the first two spans for spring-batch that I listed here.

  • Is the CONSUMER/PRODUCER kind mandatory? This makes little sense for JSR-352 implementations (spring batch), as they do not concern themselves with queueing or scheduling, they just execute jobs in the current process (either synchronously or asynchronously). You can run trigger spring batch jobs with a HTTP call (button click on your website), and AFAIR this is what one of our clients does. Anyway, for job libraries that do not deal with queueing or messaging INTERNAL would be a better choice for the span kind.
  • Is there any kind of result/status returned by at least one of those libs? Spring batch jobs return a user-defined exit_status string, which can be pretty much anything - still worth capturing.

If job executor reads a batch of tasks and splits it into chunks, should we have nested spans as well? Should they have the same semantic conventions? Should we have chunkId/itemId?

In practice, child jobs are typically enqueued as new jobs with new job_ids. It would be make more sense to do this as a parent_job_id attribute which would be common across all child jobs spawned from the original master job. Since each child is executed asynchronously in it's own process and thread they would not appear as "nested" with the parent. AFAIK DelayedJob/Resque/etc don't have such an attribute today but Airflow likely does as it is DAG-based.

This is rather different from what spring batch jobs look like. Spring batch jobs are hierarchical: every job is composed of steps, each step may be split into chunks (roughly equivalent to DB transactions), and each chunk may read, process and write several items. Everything in the same process.
Spring batch seems to be the only framework that structures its jobs this way though: I don't think that the general job spec should concern itself with steps and items, we should only model the job level here. Steps, chunks and items may be covered in a specific instrumentation spec (like the AWS one).

Comment on lines +71 to +79
- id: scheduled_run_time
type: string
brief: >
A string containing the time when the job was scheduled to run, specified in
[ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html)
format expressed in [UTC](https://www.w3.org/TR/NOTE-datetime).
For job attempts which are being retried, this should reflect when the current
attempt was scheduled (e.g. using a back-off timer.)
examples: "2021-03-23T13:47:06Z"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need a small, separate semantic convention for scheduled operations: you can run a cron job (or use a ScheduledExecutorService in Java, or Quartz, or...) that runs any code, not necessarily a batch job or a FAAS function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the key question remaining for this RFC.

@johnnyshields
Copy link
Author

Spring batch seems to be the only framework that structures its jobs this way though

I believe Apache Airflow is similar to Spring Batch in this regard. I guess there is a distinction between "Batch/Scheduled Jobs" and "Queued Jobs".

@Oberon00 Oberon00 added the spec:trace Related to the specification/trace directory label Apr 6, 2021
Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: #1582 (comment)

How to decide when use messaging convention and when this new job? E.g. Kafka consumer listening a topic and then doing some job, is it messaging or job? Why?

A process which listens/consumes from Kafka and then performs a job synchronously is probably traced as a Messaging consumer because it's Kafka listening library will be instrumented. Theoretically you could also add in a Job span but it would be redundant with the message consumer.

I also think this is an important question. We probably want to avoid cases where the sender uses messaging and the receiver job spans or vice versa. After looking at attributes I think that we should remove these:

  • job.queue_name: messaging.destination
  • net.peer.*: Also referenced by messaging.

If a job system uses queuing with a separate sender/producer of the jobs, messaging semantic conventions should be used additionally if further description of the queuing is desired.

Important editorial note: Please generate a markdown table in a new file under specification/trace/semantic_conventions/ and add a link in the README in that directory. Alternatively, you can add it to the same markdown file as messaging and change the title.
Adding a markdown file also gives you room for additional general explanation text.

@johnnyshields
Copy link
Author

johnnyshields commented Apr 6, 2021

@Oberon00 I disagree that Job systems that have "queues" should use Messaging semantics; that's not a good litmus test. "Queue" in the context of a Jobs is usually an optional feature. For example, Ruby's Delayed Job will work all jobs in FIFO order, but it optionally allows you to set named "queues" and then assign worker resources to specific queues. Unlike a Messaging system where the queue/topic is the primary feature, in a Job system the queue is an optional/secondary attribute.

A better litmus test is, "does work happen based on the contents/payload of the message". Message system delivers the payload without looking inside; its job is simply to forward the message. A Job system does work based on the payload; a successfully worked Job is usually the end of the chain (unless a new/different message is generated as a result, e.g. an email sending job.)

In addition, net.peer.* should still be used here because ultimately traces are sent from physical hosts, and we want to see which host sent the trace.

@johnnyshields

This comment has been minimized.

@arminru arminru reopened this Jun 7, 2021
@github-actions github-actions bot removed the Stale label Jun 8, 2021
@miketzian
Copy link

Ping so that this doesn't become stale, this looks like a useful addition

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 22, 2021
@jmacd jmacd removed the Stale label Jun 22, 2021
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 30, 2021
@jmacd jmacd removed the Stale label Jun 30, 2021
@github-actions
Copy link

github-actions bot commented Jul 8, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 8, 2021
@jmacd
Copy link
Contributor

jmacd commented Jul 12, 2021

We can't merge this without more reviews, and I believe there is more work to do.

@jmacd jmacd removed the Stale label Jul 12, 2021
@ahayworth
Copy link
Contributor

@ahayworth yes, I need help with the markdown documentation for this. I have some work in progress which I'll try to get committed this weekend.

@johnnyshields I can help with the documentation if needed - is there a preferred way you'd like to collaborate?

@johnnyshields johnnyshields requested a review from a team July 14, 2021 02:09
@johnnyshields
Copy link
Author

@ahayworth I've naively committed all my work in progress on the docs. If you can just fork my PR and clean everything up that will be fine.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 21, 2021
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@simaoribeiro
Copy link

Hi @johnnyshields , are you playing to continue this work in the future?
As I mentioned in this discussion #2170, my team is currently migrating from OpenTracing to OpenTelemetry and a job scheduler specification is something that is missing.
My only experience is using Hangfire, but I would gladly help push this

@johnnyshields
Copy link
Author

I'd be glad if someone can help take over this work to get to the finish line. Unfortunately I don't have the bandwidth for this at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.