-
-
Notifications
You must be signed in to change notification settings - Fork 574
feat: add maximum evaluation time limit with PROMPTFOO_MAX_EVAL_TIME_MS #4322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…MS env var and maxEvalTimeMs option
…d direct action-oriented guidance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mldangelo - I've reviewed your changes and found some issues that need to be addressed.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a global time‐limit feature for evaluations so users can cap total run time via maxEvalTimeMs
or PROMPTFOO_MAX_EVAL_TIME_MS
.
- Introduces a new
maxEvalTimeMs
option and corresponding environment variable for overall evaluation timeouts - Implements global abort logic in
Evaluator.evaluate
, tracks pending steps, and records timeout results - Updates types, schema, documentation, and tests to cover the new feature
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
test/evaluator.test.ts | Refactored mocks and added a test for aborting when maxEvalTimeMs is exceeded |
src/types/index.ts | Added maxEvalTimeMs to EvaluateOptionsSchema and the exported type |
src/evaluator.ts | Implemented global timeout/abort logic, tracking, and pending-result handling |
src/envars.ts | Introduced PROMPTFOO_MAX_EVAL_TIME_MS env var and getMaxEvalTimeMs helper |
site/static/config-schema.json | Extended JSON schema with "maxEvalTimeMs" property |
site/docs/usage/troubleshooting.md | Updated troubleshooting guide with triage steps and examples |
site/docs/usage/command-line.md | Documented the new PROMPTFOO_MAX_EVAL_TIME_MS and updated related tips |
* @returns The max duration in milliseconds, or the default value if not set. | ||
*/ | ||
export function getMaxEvalTimeMs(defaultValue: number = 0): number { | ||
return getEnvInt('PROMPTFOO_MAX_EVAL_TIME_MS', defaultValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation calls getEnvInt with a defaultValue parameter, but getEnvInt only accepts one argument. This means the defaultValue is ignored and may cause a type error; change to const val = getEnvInt('PROMPTFOO_MAX_EVAL_TIME_MS'); return val != null ? val : defaultValue;
.
return getEnvInt('PROMPTFOO_MAX_EVAL_TIME_MS', defaultValue); | |
const val = getEnvInt('PROMPTFOO_MAX_EVAL_TIME_MS'); | |
return val != null ? val : defaultValue; |
Copilot uses AI. Check for mistakes.
const originalReadFileSync = fs.readFileSync; | ||
fs.readFileSync = jest.fn().mockImplementation((path) => { | ||
if (path.includes('test_file.txt')) { | ||
jest.spyOn(fs, 'readFileSync').mockImplementation((path) => { | ||
if (typeof path === 'string' && path.includes('test_file.txt')) { | ||
return '<h1>Sample Report</h1><p>This is a test report with some data for the year 2023.</p>'; | ||
} | ||
return originalReadFileSync(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] After using jest.spyOn on fs.readFileSync, the spy is never restored and could leak into other tests. Consider adding afterEach(() => jest.restoreAllMocks());
or calling .mockRestore()
on the spy.
Copilot uses AI. Check for mistakes.
logger.info( | ||
`[${numComplete}/${serialRunEvalOptions.length}] Running ${provider} with vars: ${vars}`, | ||
); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The large try/catch block covers both serial and concurrent evaluation logic, making the method hard to follow. Consider extracting serial and concurrent processing into separate helper functions for clarity.
Copilot uses AI. Check for mistakes.
Co-authored-by: gru-agent[bot] <185149714+gru-agent[bot]@users.noreply.github.com>
⏩ No test scenarios generated (24f7e7d) View output ↗ Tip New to Tusk? Learn more here. |
WalkthroughThis change introduces a global evaluation timeout feature across the documentation, configuration schema, environment variable utilities, evaluator logic, and tests. Two new environment variables, Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
npm error Exit handler never called! ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
src/evaluator.ts (1)
1027-1030
:⚠️ Potential issueGlobal timeout signal is lost inside per-step timeout wrapper
processEvalStepWithTimeout
overwritesevalStep.abortSignal
, dropping the globalmaxEvalTimeMs
signal. Running steps will keep going after the global abort.- const evalStepWithSignal = { - ...evalStep, - abortSignal: signal, - }; + const evalStepWithSignal = { + ...evalStep, + abortSignal: evalStep.abortSignal + ? /* combine signals */ ((AbortSignal as any).any + ? (AbortSignal as any).any([evalStep.abortSignal, signal]) + : (() => { + const ac = new AbortController(); + const handler = () => ac.abort(); + evalStep.abortSignal.addEventListener('abort', handler); + signal.addEventListener('abort', handler); + return ac.signal; + })()) + : signal, + };This ensures both per-step and global cancellations propagate to the provider.
♻️ Duplicate comments (1)
test/evaluator.test.ts (1)
30-44
: Redundantresult
variable – can be inlined
Returning the object directly keeps the mock concise and removes one unnecessary allocation.
🧹 Nitpick comments (4)
site/docs/usage/command-line.md (1)
517-518
: Add default values for timeout environment variables
The table entries forPROMPTFOO_EVAL_TIMEOUT_MS
andPROMPTFOO_MAX_EVAL_TIME_MS
lack default values. Consider adding0
to the Default column to reflect the fallback behavior.site/docs/usage/troubleshooting.md (1)
175-181
: Minor wording nitpick“…providers that get stuck” → “…providers that become stuck” for a slightly more formal tone (mirrors LanguageTool lint warning).
🧰 Tools
🪛 LanguageTool
[style] ~178-~178: The verb “get” can be informal. Consider replacing it with a form of “to be”.
Context: ...ndle custom providers or providers that get stuck - Prevent runaway costs from long-runni...(GET_USED_ARE_USED)
test/evaluator.test.ts (2)
302-308
: Optional: speed-up by switching to fake timers
This test relies on real timers;jest.useFakeTimers()
+advanceTimersByTime()
would make it complete instantly and avoid flakiness on busy CI runners.
2297-2369
: Test looks good – but addafterEach
cleanup for mocks
slowApiProvider.cleanup
is asserted but the spy itself isn’t restored. AddafterEach(() => jest.restoreAllMocks())
in this describe block to avoid leakage into other tests.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
site/docs/usage/command-line.md
(2 hunks)site/docs/usage/troubleshooting.md
(1 hunks)site/static/config-schema.json
(1 hunks)src/envars.ts
(2 hunks)src/evaluator.ts
(5 hunks)src/types/index.ts
(1 hunks)test/envars.test.ts
(2 hunks)test/evaluator.test.ts
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`src/**`: - This is a CLI tool so errors need to be handled gracefully and logged with lots of information so the user can give us enough data to fix the issue or pass it to the de...
src/**
: - This is a CLI tool so errors need to be handled gracefully and logged with lots of information so the user can give us enough data to fix the issue or pass it to the developers.
src/types/index.ts
src/envars.ts
src/evaluator.ts
🧬 Code Graph Analysis (1)
test/envars.test.ts (1)
src/envars.ts (1)
getMaxEvalTimeMs
(399-401)
🪛 LanguageTool
site/docs/usage/troubleshooting.md
[style] ~178-~178: The verb “get” can be informal. Consider replacing it with a form of “to be”.
Context: ...ndle custom providers or providers that get stuck - Prevent runaway costs from long-runni...
(GET_USED_ARE_USED)
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: Share Test
- GitHub Check: Redteam Custom Enterprise Server
- GitHub Check: Redteam
- GitHub Check: Test on Node 24.x and macOS-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 18.x and ubuntu-latest
- GitHub Check: Test on Node 18.x and macOS-latest
- GitHub Check: Test on Node 18.x and windows-latest
- GitHub Check: Build on Node 20.x
- GitHub Check: Style Check
- GitHub Check: Build Docs
- GitHub Check: Build on Node 24.x
- GitHub Check: Build on Node 22.x
- GitHub Check: Build on Node 18.x
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
src/envars.ts (2)
41-41
: AddedPROMPTFOO_MAX_EVAL_TIME_MS
toEnvVars
This mirrors the existingPROMPTFOO_EVAL_TIMEOUT_MS
entry and correctly expands the supported environment variables.
394-402
: ImplementedgetMaxEvalTimeMs
correctly
The function consistently usesgetEnvInt
with a default value and matches the pattern ofgetEvalTimeoutMs
.site/static/config-schema.json (1)
1523-1526
: Schema extension formaxEvalTimeMs
is correct
The new numeric property is appropriately added underevaluateOptions
and aligns withtimeoutMs
.test/envars.test.ts (1)
346-386
: Questionable acceptance of negative timeoutsThe new negative-value test (lines 377-380) expects
getMaxEvalTimeMs()
to return-1000
.
A negative global timeout has no clear meaning (should the evaluation abort immediately? be unlimited?). Accepting it silently risks surprises later inevaluator.ts
.Consider clamping to
0
or rejecting < 0 ingetMaxEvalTimeMs
(and adjust the schema as suggested) instead of supporting an undefined semantic.If negative values are intentionally supported, add documentation clarifying their effect and a unit test in
evaluator.test.ts
that demonstrates the expected behaviour.
timeoutMs: z.number().optional(), | ||
/** | ||
* Maximum runtime in milliseconds for the entire evaluation. Default is 0 (no limit). | ||
*/ | ||
maxEvalTimeMs: z.number().optional(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make maxEvalTimeMs
an integer ≥ 0 to prevent invalid configurations
timeoutMs
and maxEvalTimeMs
serve the same purpose (time limits) and should behave identically.
Using just .number()
accepts floats and negative values even though neither makes sense for a timeout. Tighten the schema to match the runtime helper (getMaxEvalTimeMs
, which floors the value to an int) and avoid unexpected behaviour.
- /**
- * Maximum runtime in milliseconds for the entire evaluation. Default is 0 (no limit).
- */
- maxEvalTimeMs: z.number().optional(),
+ /**
+ * Maximum runtime in milliseconds for the entire evaluation.
+ * Must be an integer ≥ 0. 0 means “no limit”.
+ */
+ maxEvalTimeMs: z.coerce.number().int().nonnegative().optional(),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
timeoutMs: z.number().optional(), | |
/** | |
* Maximum runtime in milliseconds for the entire evaluation. Default is 0 (no limit). | |
*/ | |
maxEvalTimeMs: z.number().optional(), | |
}); | |
timeoutMs: z.number().optional(), | |
/** | |
* Maximum runtime in milliseconds for the entire evaluation. | |
* Must be an integer ≥ 0. 0 means “no limit”. | |
*/ | |
maxEvalTimeMs: z.coerce.number().int().nonnegative().optional(), | |
}); |
🤖 Prompt for AI Agents
In src/types/index.ts around lines 188 to 193, the maxEvalTimeMs field is
currently defined as an optional number, which allows floats and negative
values. To fix this, change the schema to require maxEvalTimeMs to be an integer
greater than or equal to 0, matching the behavior of timeoutMs and the runtime
helper getMaxEvalTimeMs. Use appropriate zod methods to enforce integer type and
non-negativity to prevent invalid timeout configurations.
if (globalTimeout) { | ||
clearTimeout(globalTimeout); | ||
} | ||
|
||
if (evalTimedOut) { | ||
for (let i = 0; i < runEvalOptions.length; i++) { | ||
if (!processedIndices.has(i)) { | ||
const evalStep = runEvalOptions[i]; | ||
const timeoutResult = { | ||
provider: { | ||
id: evalStep.provider.id(), | ||
label: evalStep.provider.label, | ||
config: evalStep.provider.config, | ||
}, | ||
prompt: { | ||
raw: evalStep.prompt.raw, | ||
label: evalStep.prompt.label, | ||
config: evalStep.prompt.config, | ||
}, | ||
vars: evalStep.test.vars || {}, | ||
error: `Evaluation exceeded max duration of ${maxEvalTimeMs}ms`, | ||
success: false, | ||
failureReason: ResultFailureReason.ERROR, | ||
score: 0, | ||
namedScores: {}, | ||
latencyMs: Date.now() - startTime, | ||
promptIdx: evalStep.promptIdx, | ||
testIdx: evalStep.testIdx, | ||
testCase: evalStep.test, | ||
promptId: evalStep.prompt.id || '', | ||
} as EvaluateResult; | ||
|
||
await this.evalRecord.addResult(timeoutResult); | ||
this.stats.errors++; | ||
const { metrics } = prompts[evalStep.promptIdx]; | ||
if (metrics) { | ||
metrics.testErrorCount += 1; | ||
metrics.totalLatencyMs += timeoutResult.latencyMs; | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Unprocessed-step timeout results update prompt metrics but not top-level token usage
this.stats.errors++
is updated, yet aggregate tokenUsage
isn’t touched. If any tokens were spent before the global abort, totals will be under-reported. Consider adding the same token-usage accumulation you perform elsewhere.
🤖 Prompt for AI Agents
In src/evaluator.ts between lines 1395 and 1436, the timeout handling code
updates error counts and prompt-level metrics but does not update the top-level
token usage statistics. To fix this, add the same token usage accumulation logic
here as used elsewhere in the evaluation process, ensuring that any tokens
consumed before the timeout are correctly added to the aggregate tokenUsage
totals.
const startTime = Date.now(); | ||
const maxEvalTimeMs = options.maxEvalTimeMs ?? getMaxEvalTimeMs(); | ||
let evalTimedOut = false; | ||
let globalTimeout: NodeJS.Timeout | undefined; | ||
let globalAbortController: AbortController | undefined; | ||
const processedIndices = new Set<number>(); | ||
|
||
if (maxEvalTimeMs > 0) { | ||
globalAbortController = new AbortController(); | ||
options.abortSignal = options.abortSignal | ||
? AbortSignal.any([options.abortSignal, globalAbortController.signal]) | ||
: globalAbortController.signal; | ||
globalTimeout = setTimeout(() => { | ||
evalTimedOut = true; | ||
globalAbortController?.abort(); | ||
}, maxEvalTimeMs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbortSignal.any
may break Node≤18 & older TypeScript targets
AbortSignal.any()
is only available from Node 20 / lib.dom-2023. Building/running on older runtimes will fail at compile-time and at runtime.
- options.abortSignal = options.abortSignal
- ? AbortSignal.any([options.abortSignal, globalAbortController.signal])
- : globalAbortController.signal;
+ options.abortSignal = options.abortSignal
+ ? // Fallback for Node <20
+ (AbortSignal as any).any
+ ? (AbortSignal as any).any([options.abortSignal, globalAbortController.signal])
+ : (() => {
+ const ac = new AbortController();
+ const handler = () => ac.abort();
+ options.abortSignal!.addEventListener('abort', handler);
+ globalAbortController.signal.addEventListener('abort', handler);
+ return ac.signal;
+ })()
+ : globalAbortController.signal;
Provide a polyfill or guard as above, or bump the minimum Node / TS target.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const startTime = Date.now(); | |
const maxEvalTimeMs = options.maxEvalTimeMs ?? getMaxEvalTimeMs(); | |
let evalTimedOut = false; | |
let globalTimeout: NodeJS.Timeout | undefined; | |
let globalAbortController: AbortController | undefined; | |
const processedIndices = new Set<number>(); | |
if (maxEvalTimeMs > 0) { | |
globalAbortController = new AbortController(); | |
options.abortSignal = options.abortSignal | |
? AbortSignal.any([options.abortSignal, globalAbortController.signal]) | |
: globalAbortController.signal; | |
globalTimeout = setTimeout(() => { | |
evalTimedOut = true; | |
globalAbortController?.abort(); | |
}, maxEvalTimeMs); | |
} | |
const startTime = Date.now(); | |
const maxEvalTimeMs = options.maxEvalTimeMs ?? getMaxEvalTimeMs(); | |
let evalTimedOut = false; | |
let globalTimeout: NodeJS.Timeout | undefined; | |
let globalAbortController: AbortController | undefined; | |
const processedIndices = new Set<number>(); | |
if (maxEvalTimeMs > 0) { | |
globalAbortController = new AbortController(); | |
options.abortSignal = options.abortSignal | |
? // Fallback for Node <20 | |
(AbortSignal as any).any | |
? (AbortSignal as any).any([options.abortSignal, globalAbortController.signal]) | |
: (() => { | |
const ac = new AbortController(); | |
const handler = () => ac.abort(); | |
options.abortSignal!.addEventListener('abort', handler); | |
globalAbortController.signal.addEventListener('abort', handler); | |
return ac.signal; | |
})() | |
: globalAbortController.signal; | |
globalTimeout = setTimeout(() => { | |
evalTimedOut = true; | |
globalAbortController?.abort(); | |
}, maxEvalTimeMs); | |
} |
🤖 Prompt for AI Agents
In src/evaluator.ts around lines 531 to 547, the use of AbortSignal.any() is not
supported in Node.js versions 18 and below or older TypeScript targets, causing
build and runtime failures. To fix this, implement a polyfill or conditional
check that only uses AbortSignal.any() if it exists; otherwise, fallback to an
alternative approach such as manually combining multiple abort signals. This
ensures compatibility with older Node.js versions without raising errors.
Adds a new maximum evaluation time limit feature with PROMPTFOO_MAX_EVAL_TIME_MS environment variable and maxEvalTimeMs API option. Includes comprehensive documentation and test coverage. Useful for CI/CD time limits, cost control, and preventing runaway evaluations.