Agentic Abstention

Do Agents Know When to Stop Instead of Act?

Han Luo* Bingbing Wen* Lucy Lu Wang
* Equal contribution
Paper pending Code and Data Citation pending arXiv pending
ACT
->
OBSERVE
->
ABSTAIN
Timely, delayed, and failed abstention trajectories in a web shopping task
Agentic abstention is about when to stop: an agent can abstain as soon as infeasibility is known, delay after unnecessary tool calls, or fail to abstain within the interaction budget.

Agents need a stopping policy, not only a task policy.

Tool-using LLM agents often face goals that are ambiguous, underspecified, or impossible in the current environment. A reliable agent should recognize when continued interaction is unlikely to help and stop with an abstention instead of spending turns on futile actions.

01

Answer

Complete the task when the available evidence supports a valid final response or action.

02

Act

Search, browse, inspect files, or gather more observations when uncertainty can still be reduced.

03

Abstain

Stop when the task is infeasible, contradictory, or missing information that cannot be recovered.

More than 28,000 tasks across web, terminal, and QA settings.

The benchmark combines solvable tasks with abstention-warranted variants where the agent may need to interact before realizing that the request cannot be satisfied.

28K+ instructions
13 LLM-as-agent systems
2 agent scaffolds
3 interactive scenarios
Terminal task construction and rewritten abstention instructions
Task construction rewrites solvable environments into abstention-warranted cases while preserving the interaction interface.

WebShop

Shopping instructions are made unsolvable by ambiguous requests or missing catalog targets that only become clear after interaction.

Terminal-Bench

Terminal tasks are rewritten to include missing prerequisites, false premises, contradictions, or underspecified goals.

Interactive QA

AbstentionBench datasets are adapted into a multi-turn setting where agents can answer, abstain, or search.

Most agents abstain too late, if they abstain at all.

Abstention recall improves with additional turns, but timely abstention remains low across settings. This indicates that agents often discover infeasibility only after unnecessary interactions.

Abstention recall curves across Web, Terminal, and QA settings
Abstention recall increases with larger interaction budgets, while early abstention remains difficult.
Web 26.7%

timely recall for the strongest baseline, despite much higher eventual recall.

Terminal 21.6%

best timely recall on abstention tasks under the tested GPT-5.4-mini configurations.

All settings <40%

average timely recall for every evaluated model group in the reported comparisons.

Abstention recall by abstention category
Difficulty varies by abstention category. Missing targets and missing prerequisites create delayed-abstention failures.
Reasoning effects on AbsRec@1 and AbsRec@10
Reasoning can improve early recall, but may reduce overall recall.
Cumulative over-abstention rate by turn in Web and Terminal scenarios
Over-abstention increases with interaction, while reasoning helps mitigate it.
Timely and overall recall by model parameter count
Scale mainly helps eventual recall, not necessarily timely recall.

CONVOLVE turns full interaction trajectories into reusable stopping rules.

Rather than updating model parameters, CONVOLVE distills observed failures and timely abstention evidence into a playbook that is appended to the agent context.

01 Roll out

Run agents in WebShop using the original action interface.

02 Reflect

Analyze the full trajectory for evidence that made abstention warranted.

03 Curate

Compress repeated lessons into a concise playbook.

04 Reuse

Append the playbook to future agent context without changing tools.

CONVOLVE result table on WebShop
CONVOLVE uses 20 trajectories and improves timely abstention without model updates.
AbsRec@1 26.7 -> 57.4
AbsRec@10 83.2 -> 100.0
SPL 55.3 -> 78.9

Lessons learned by smaller models can transfer to larger models, suggesting that the useful signal is the distilled stopping rule rather than only the model that produced it.

Reliable agents need better judgment about continued action.

01

Timing matters

Correct eventual abstention can still be inefficient if the agent keeps acting after infeasibility is already clear.

02

Environment evidence is hard

Missing target and missing prerequisite tasks are difficult because infeasibility is revealed through interaction.

03

Scale is not enough

Larger models improve eventual recall more than timely recall, so stronger models do not automatically stop earlier.

04

Scaffolds matter

Terminal results show that the same base model behaves differently under different agent scaffolds.

05

Context can teach stopping

CONVOLVE improves abstention by reusing lessons from full trajectories as explicit context.

Citation pending

BibTeX will be added after the public paper release.