Skip to content

Commit f280df9

Browse files
hdshawkw
authored andcommitted
feat: add scheduled time per task (#406)
Each task displays the sum of the time it has been idle and busy, as well as the total. The idle time includes the time between when a task is woken, and when the runtime actually polls that task. There are cases where a task may be scheduled for a long time after being woken, before it is polled. Especially if many tasks are woken at the same time and don't yield back to the runtime quickly. To add visibility to this, the total time that a task is scheduled before being polled has been added. Additionally, a new task state `Scheduled` has been added. This is displayed in both the tasks table and in the task detail view. In the `console-api`, the total `scheduled_time` for the task has been added to the `TaskStats` message in `tasks.proto`. This is the first of two parts. In the second part (#409), a histogram for scheduled time will be added, the equivalent of the poll time histogram which is already available on the task detail screen. To show a pathological case which may lead to needing to see the scheduled time, a new example has been added to the `console-subscriber` ## PR Notes This PR does something adjacent to what is described in #50, but not quite. The unicode character used for a `SCHED` task is ⏫. The second PR (#409) will record a scheduled time histogram the same as it recorded for poll times. These two changes should go in together (and certainly shouldn't be released separately). However, this PR is already quite big, so they'll be separated out. The idea is that this PR isn't merged until the next one has also been reviewed and approved. It would be good to get some feedback at this stage though. The task list view with the new column for `Sched` time: <img width="1032" alt="a tasks table view for the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232456977-2921f884-4673-420f-ba4f-3646627d44db.png"> The `Task` block in the task detail view showing the new `Scheduled` time entry. <img width="510" alt="The task block on the task detail view for the rx task in the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232457332-e455e086-9468-42c9-8fda-7965d8d1e6f8.png">
1 parent 4409443 commit f280df9

File tree

10 files changed

+230
-50
lines changed

10 files changed

+230
-50
lines changed

console-api/proto/common.proto

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -177,10 +177,13 @@ message PollStats {
177177
// its poll method has completed.
178178
optional google.protobuf.Timestamp last_poll_ended = 5;
179179
// The total duration this object was being *actively polled*, summed across
180-
// all polls. Note that this includes only polls that have completed and is
181-
// not reflecting any inprogress polls. Subtracting `busy_time` from the
180+
// all polls.
181+
//
182+
// Note that this includes only polls that have completed, and does not
183+
// reflect any in-progress polls. Subtracting `busy_time` from the
182184
// total lifetime of the polled object results in the amount of time it
183-
// has spent *waiting* to be polled.
185+
// has spent *waiting* to be polled (including the `scheduled_time` value
186+
// from `TaskStats`, if this is a task).
184187
google.protobuf.Duration busy_time = 6;
185188
}
186189

console-api/proto/tasks.proto

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,16 @@ message Stats {
130130
common.PollStats poll_stats = 7;
131131
// The total number of times this task has woken itself.
132132
uint64 self_wakes = 8;
133+
// The total duration this task was scheduled prior to being polled, summed
134+
// across all poll cycles.
135+
//
136+
// Note that this includes only polls that have started, and does not
137+
// reflect any scheduled state where the task hasn't yet been polled.
138+
// Subtracting both `busy_time` (from the task's `PollStats`) and
139+
// `scheduled_time` from the total lifetime of the task results in the
140+
// amount of time it spent unable to progress because it was waiting on
141+
// some resource.
142+
google.protobuf.Duration scheduled_time = 9;
133143
}
134144

135145

console-api/src/generated/rs.tokio.console.common.rs

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -253,10 +253,13 @@ pub struct PollStats {
253253
#[prost(message, optional, tag="5")]
254254
pub last_poll_ended: ::core::option::Option<::prost_types::Timestamp>,
255255
/// The total duration this object was being *actively polled*, summed across
256-
/// all polls. Note that this includes only polls that have completed and is
257-
/// not reflecting any inprogress polls. Subtracting `busy_time` from the
256+
/// all polls.
257+
///
258+
/// Note that this includes only polls that have completed, and does not
259+
/// reflect any in-progress polls. Subtracting `busy_time` from the
258260
/// total lifetime of the polled object results in the amount of time it
259-
/// has spent *waiting* to be polled.
261+
/// has spent *waiting* to be polled (including the `scheduled_time` value
262+
/// from `TaskStats`, if this is a task).
260263
#[prost(message, optional, tag="6")]
261264
pub busy_time: ::core::option::Option<::prost_types::Duration>,
262265
}

console-api/src/generated/rs.tokio.console.tasks.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,17 @@ pub struct Stats {
167167
/// The total number of times this task has woken itself.
168168
#[prost(uint64, tag="8")]
169169
pub self_wakes: u64,
170+
/// The total duration this task was scheduled prior to being polled, summed
171+
/// across all poll cycles.
172+
///
173+
/// Note that this includes only polls that have started, and does not
174+
/// reflect any scheduled state where the task hasn't yet been polled.
175+
/// Subtracting both `busy_time` (from the task's `PollStats`) and
176+
/// `scheduled_time` from the total lifetime of the task results in the
177+
/// amount of time it spent unable to progress because it was waiting on
178+
/// some resource.
179+
#[prost(message, optional, tag="9")]
180+
pub scheduled_time: ::core::option::Option<::prost_types::Duration>,
170181
}
171182
#[derive(Clone, PartialEq, ::prost::Message)]
172183
pub struct DurationHistogram {
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
//! Long scheduled time
2+
//!
3+
//! This example shows an application with a task that has an excessive
4+
//! time between being woken and being polled.
5+
//!
6+
//! It consists of a channel where a sender task sends a message
7+
//! through the channel and then immediately does a lot of work
8+
//! (simulated in this case by a call to `std::thread::sleep`).
9+
//!
10+
//! As soon as the sender task calls `send()` the receiver task gets
11+
//! woken, but because there's only a single worker thread, it doesn't
12+
//! get polled until after the sender task has finished "working" and
13+
//! yields (via `tokio::time::sleep`).
14+
//!
15+
//! In the console, this is visible in the `rx` task, which has very
16+
//! high scheduled times - in the detail screen you will see that around
17+
//! it is scheduled around 98% of the time. The `tx` task, on the other
18+
//! hand, is busy most of the time.
19+
use std::time::Duration;
20+
21+
use console_subscriber::ConsoleLayer;
22+
use tokio::{sync::mpsc, task};
23+
use tracing::info;
24+
25+
#[tokio::main(flavor = "multi_thread", worker_threads = 1)]
26+
async fn main() -> Result<(), Box<dyn std::error::Error>> {
27+
ConsoleLayer::builder()
28+
.with_default_env()
29+
.publish_interval(Duration::from_millis(100))
30+
.init();
31+
32+
let (tx, rx) = mpsc::channel::<u32>(1);
33+
let count = 10000;
34+
35+
let jh_rx = task::Builder::new()
36+
.name("rx")
37+
.spawn(receiver(rx, count))
38+
.unwrap();
39+
let jh_tx = task::Builder::new()
40+
.name("tx")
41+
.spawn(sender(tx, count))
42+
.unwrap();
43+
44+
let res_tx = jh_tx.await;
45+
let res_rx = jh_rx.await;
46+
info!(
47+
"main: Joined sender: {:?} and receiver: {:?}",
48+
res_tx, res_rx,
49+
);
50+
51+
tokio::time::sleep(Duration::from_millis(200)).await;
52+
53+
Ok(())
54+
}
55+
56+
async fn sender(tx: mpsc::Sender<u32>, count: u32) {
57+
info!("tx: started");
58+
59+
for idx in 0..count {
60+
let msg: u32 = idx;
61+
let res = tx.send(msg).await;
62+
info!("tx: sent msg '{}' result: {:?}", msg, res);
63+
64+
std::thread::sleep(Duration::from_millis(5000));
65+
info!("tx: work done");
66+
67+
tokio::time::sleep(Duration::from_millis(100)).await;
68+
}
69+
}
70+
71+
async fn receiver(mut rx: mpsc::Receiver<u32>, count: u32) {
72+
info!("rx: started");
73+
74+
for _ in 0..count {
75+
let msg = rx.recv().await;
76+
info!("rx: Received message: '{:?}'", msg);
77+
}
78+
}

console-subscriber/src/stats.rs

Lines changed: 64 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ pub(crate) struct TaskStats {
5656
is_dropped: AtomicBool,
5757
// task stats
5858
pub(crate) created_at: Instant,
59-
timestamps: Mutex<TaskTimestamps>,
59+
dropped_at: Mutex<Option<Instant>>,
6060

6161
// waker stats
6262
wakes: AtomicUsize,
@@ -100,12 +100,6 @@ pub(crate) struct ResourceStats {
100100
pub(crate) parent_id: Option<Id>,
101101
}
102102

103-
#[derive(Debug, Default)]
104-
struct TaskTimestamps {
105-
dropped_at: Option<Instant>,
106-
last_wake: Option<Instant>,
107-
}
108-
109103
#[derive(Debug, Default)]
110104
struct PollStats<H> {
111105
/// The number of polls in progress
@@ -118,9 +112,11 @@ struct PollStats<H> {
118112
#[derive(Debug, Default)]
119113
struct PollTimestamps<H> {
120114
first_poll: Option<Instant>,
115+
last_wake: Option<Instant>,
121116
last_poll_started: Option<Instant>,
122117
last_poll_ended: Option<Instant>,
123118
busy_time: Duration,
119+
scheduled_time: Duration,
124120
histogram: H,
125121
}
126122

@@ -162,14 +158,16 @@ impl TaskStats {
162158
is_dirty: AtomicBool::new(true),
163159
is_dropped: AtomicBool::new(false),
164160
created_at,
165-
timestamps: Mutex::new(TaskTimestamps::default()),
161+
dropped_at: Mutex::new(None),
166162
poll_stats: PollStats {
167163
timestamps: Mutex::new(PollTimestamps {
168164
histogram: Histogram::new(poll_duration_max),
169165
first_poll: None,
166+
last_wake: None,
170167
last_poll_started: None,
171168
last_poll_ended: None,
172169
busy_time: Duration::new(0, 0),
170+
scheduled_time: Duration::new(0, 0),
173171
}),
174172
current_polls: AtomicUsize::new(0),
175173
polls: AtomicUsize::new(0),
@@ -209,13 +207,14 @@ impl TaskStats {
209207
}
210208

211209
fn wake(&self, at: Instant, self_wake: bool) {
212-
let mut timestamps = self.timestamps.lock();
213-
timestamps.last_wake = cmp::max(timestamps.last_wake, Some(at));
214-
self.wakes.fetch_add(1, Release);
210+
self.poll_stats.wake(at);
215211

212+
self.wakes.fetch_add(1, Release);
216213
if self_wake {
217214
self.wakes.fetch_add(1, Release);
218215
}
216+
217+
self.make_dirty();
219218
}
220219

221220
pub(crate) fn start_poll(&self, at: Instant) {
@@ -235,8 +234,7 @@ impl TaskStats {
235234
return;
236235
}
237236

238-
let mut timestamps = self.timestamps.lock();
239-
let _prev = timestamps.dropped_at.replace(dropped_at);
237+
let _prev = self.dropped_at.lock().replace(dropped_at);
240238
debug_assert_eq!(_prev, None, "tried to drop a task twice; this is a bug!");
241239
self.make_dirty();
242240
}
@@ -257,16 +255,28 @@ impl ToProto for TaskStats {
257255

258256
fn to_proto(&self, base_time: &TimeAnchor) -> Self::Output {
259257
let poll_stats = Some(self.poll_stats.to_proto(base_time));
260-
let timestamps = self.timestamps.lock();
258+
let timestamps = self.poll_stats.timestamps.lock();
261259
proto::tasks::Stats {
262260
poll_stats,
263261
created_at: Some(base_time.to_timestamp(self.created_at)),
264-
dropped_at: timestamps.dropped_at.map(|at| base_time.to_timestamp(at)),
262+
dropped_at: self.dropped_at.lock().map(|at| base_time.to_timestamp(at)),
265263
wakes: self.wakes.load(Acquire) as u64,
266264
waker_clones: self.waker_clones.load(Acquire) as u64,
267265
self_wakes: self.self_wakes.load(Acquire) as u64,
268266
waker_drops: self.waker_drops.load(Acquire) as u64,
269267
last_wake: timestamps.last_wake.map(|at| base_time.to_timestamp(at)),
268+
scheduled_time: Some(
269+
timestamps
270+
.scheduled_time
271+
.try_into()
272+
.unwrap_or_else(|error| {
273+
eprintln!(
274+
"failed to convert `scheduled_time` to protobuf duration: {}",
275+
error
276+
);
277+
Default::default()
278+
}),
279+
),
270280
}
271281
}
272282
}
@@ -287,7 +297,7 @@ impl DroppedAt for TaskStats {
287297
// avoid acquiring the lock if we know we haven't tried to drop this
288298
// thing yet
289299
if self.is_dropped.load(Acquire) {
290-
return self.timestamps.lock().dropped_at;
300+
return *self.dropped_at.lock();
291301
}
292302

293303
None
@@ -466,18 +476,46 @@ impl ToProto for ResourceStats {
466476
// === impl PollStats ===
467477

468478
impl<H: RecordPoll> PollStats<H> {
469-
fn start_poll(&self, at: Instant) {
470-
if self.current_polls.fetch_add(1, AcqRel) == 0 {
471-
// We are starting the first poll
472-
let mut timestamps = self.timestamps.lock();
473-
if timestamps.first_poll.is_none() {
474-
timestamps.first_poll = Some(at);
475-
}
479+
fn wake(&self, at: Instant) {
480+
let mut timestamps = self.timestamps.lock();
481+
timestamps.last_wake = cmp::max(timestamps.last_wake, Some(at));
482+
}
476483

477-
timestamps.last_poll_started = Some(at);
484+
fn start_poll(&self, at: Instant) {
485+
if self.current_polls.fetch_add(1, AcqRel) > 0 {
486+
return;
487+
}
478488

479-
self.polls.fetch_add(1, Release);
489+
// We are starting the first poll
490+
let mut timestamps = self.timestamps.lock();
491+
if timestamps.first_poll.is_none() {
492+
timestamps.first_poll = Some(at);
480493
}
494+
495+
timestamps.last_poll_started = Some(at);
496+
497+
self.polls.fetch_add(1, Release);
498+
499+
// If the last poll ended after the last wake then it was likely
500+
// a self-wake, so we measure from the end of the last poll instead.
501+
// This also ensures that `busy_time` and `scheduled_time` don't overlap.
502+
let scheduled = match std::cmp::max(timestamps.last_wake, timestamps.last_poll_ended) {
503+
Some(scheduled) => scheduled,
504+
None => return, // Async operations record polls, but not wakes
505+
};
506+
507+
let elapsed = match at.checked_duration_since(scheduled) {
508+
Some(elapsed) => elapsed,
509+
None => {
510+
eprintln!(
511+
"possible Instant clock skew detected: a poll's start timestamp \
512+
was before the wake time/last poll end timestamp\nwake = {:?}\n start = {:?}",
513+
scheduled, at
514+
);
515+
return;
516+
}
517+
};
518+
timestamps.scheduled_time += elapsed;
481519
}
482520

483521
fn end_poll(&self, at: Instant) {
@@ -534,7 +572,7 @@ impl<H> ToProto for PollStats<H> {
534572
.map(|at| base_time.to_timestamp(at)),
535573
busy_time: Some(timestamps.busy_time.try_into().unwrap_or_else(|error| {
536574
eprintln!(
537-
"failed to convert busy time to protobuf duration: {}",
575+
"failed to convert `busy_time` to protobuf duration: {}",
538576
error
539577
);
540578
Default::default()

0 commit comments

Comments
 (0)