10 · Mental Loop (Simulator) — imagine outcomes before committing¶
TL;DR. Before acting, the agent mentally simulates the predicted outcome of each candidate action, scores those simulations, and only then commits to the highest-scoring one. The pattern: generate K candidates → simulate each → score → pick best → explain.
Reach for it when the action is hard to undo (financial, medical, deployments, robotics), when "imagining first" is cheaper than "acting and recovering", or when stakeholders need to see the reasoning before execution. Avoid when the action is trivially reversible — Tool Use (notebook 02) is cheaper.
| Property | Value |
|---|---|
| Origin | Robotics "world models" + classical AI deliberation |
| Reasoning style | Flat (1 layer of simulation per action) — vs. ToT's tree |
| External tools needed? | No (the simulator IS the LLM) |
| Cost | 1 generate + N simulate + 1 explain LLM calls (≈ N+2) |
| Structured output? | Yes — Pydantic candidate-list + per-action outcome schemas |
The pattern is structurally similar to ToT (notebook 09) but flat instead of deep: ToT builds a tree of reasoning steps; Mental Loop builds a single layer of action-outcome pairs and picks the best. Mental Loop is the natural fit when candidates are actions in the world, not reasoning steps in a chain.
2 · Architecture at a glance¶
flowchart LR
A([task]) --> G[Generate
K candidate actions]
G --> S[Simulate
predicted outcome + risks + benefits + score
per candidate]
S --> D[Decide
argmax over scores]
D --> E[Explain
state choice + tradeoff vs runner-up]
E --> Z([recommendation])
style G fill:#e3f2fd,stroke:#1976d2
style S fill:#fff3e0,stroke:#f57c00
style D fill:#fce4ec,stroke:#c2185b
style E fill:#e8f5e9,stroke:#388e3c
Linear, four-step pipeline. No loops, no retries. The work is in the simulation step — that's where the LLM does most of its thinking.
3 · Theory¶
3.1 · Why simulate at all?¶
Most agentic patterns commit to the next action greedily — ReAct picks the best tool call given the moment, Planning picks steps in order. That's optimal when actions are cheap and reversible: try one, see what happens, retry if it doesn't work.
For expensive or irreversible actions — sending an email to a customer, executing a financial trade, deploying a code change, prescribing a medication — the "try and recover" loop is wrong. You need to deliberate before the action is taken, weighing what would happen for each option.
That's exactly what Mental Loop does: it converts the "act now, learn later" loop into "imagine first, act second".
3.2 · The simulator is just an LLM¶
In robotics, "world model" usually means a learned dynamics model that predicts physical state evolution. In an LLM agent, the simulator is just the LLM itself with a different prompt: instead of "what should we do?" the prompt is "imagine doing X and describe what would happen".
This works surprisingly well for high-level decisions (strategy, planning, weighing options) because LLMs have absorbed enormous prior knowledge about cause-and-effect from their training data. It works less well for fine-grained physical predictions ("if the robot lifts its arm 3cm to the left, what happens to the cup?"), where a learned dynamics model still wins.
3.3 · The two-schema design¶
class _CandidateActions(BaseModel):
actions: list[str] = Field(min_length=2, description="K distinct candidate actions...")
class _SimulatedOutcome(BaseModel):
predicted_outcome: str # 2-3 sentence prediction
benefits: list[str]
risks: list[str]
overall_score: int = Field(ge=1, le=5) # STRICT scale
rationale: str
The separation matters:
_CandidateActionsis called once, gets K different actions in one structured response._SimulatedOutcomeis called K times, once per candidate, so each gets its own focused prediction.
Calling the simulator once per action (rather than batched) lets the LLM use its full reasoning budget on a single action — better outcome quality than asking it to weigh K options simultaneously.
3.4 · The strict 1-5 scoring rubric¶
Just like ToT (notebook 09), Mental Loop's value depends on the scorer discriminating between candidates. If every action scores 4/5, the argmax is meaningless. The _SimulatedOutcome.overall_score field's description explicitly says: "Be discriminating. 1 = clearly bad / high risk. 5 = clearly excellent. Most actions 2-4." Without this calibration anchor, models tend to compress scores into a narrow band.
3.5 · Where Mental Loop sits¶
| Pattern | Layers | Candidate type | Use when |
|---|---|---|---|
| ToT (nb 09) | tree (K^D) | reasoning steps | hard reasoning task |
| Mental Loop (this notebook) | flat (K) | real-world actions | irreversible / high-stakes choice |
| Best-of-N (sample + pick) | flat | full solutions | parallel sampling helps |
| Ensemble (nb 13) | flat (parallel agents) | full answers | diverse perspectives |
| Self-Consistency (nb 21) | flat (parallel samples) | reasoning chains | majority-vote helps |
| Dry-Run (nb 14) | linear | one proposed action | execute-after-check |
Compared to Dry-Run (notebook 14): Dry-Run simulates one specific proposed action and asks for approval. Mental Loop generates and simulates K alternatives before picking. Use Dry-Run when you have a specific candidate in mind; use Mental Loop when you need to discover the best option.
3.6 · What goes wrong (you'll see in § 9)¶
- Flat scores — the simulator gives every candidate a similar score, so argmax is arbitrary. See ToT § 11.1 for the standard mitigation.
- Optimistic predictions — the LLM imagines best-case outcomes for every candidate. The
risksfield forces it to think about downsides, but a same-model simulator still has blind spots. - Confabulated specifics — the "predicted outcome" includes invented numbers / details ("you'll save $1,247 per year"). Production code should mark predictions as estimates.
- Limited candidate diversity — K=3 candidates often span only a narrow range. Raise temperature or explicitly prompt for unconventional options.
4 · Setup¶
We pass scoring_fn=commute_score_from_minutes to MentalLoop. This is the deterministic-scoring fix — the LLM predicts predicted_metric (minutes) but Python computes the final 1-5 score from that number. This bypasses the LLM-as-Scorer flatness pathology entirely.
from agentic_architectures import get_llm, enable_langsmith, settings
from agentic_architectures.architectures import MentalLoop
from agentic_architectures.ui import print_md, print_header, print_step
enable_langsmith()
print_header(f"Provider: {settings.llm_provider} · Model: {settings.llm_model}")
# Deterministic scoring function — Python decides the score from the LLM's predicted minutes.
def commute_score_from_minutes(minutes: float) -> int:
"""Map predicted door-to-door minutes → 1-5 score."""
if minutes < 40: return 5
if minutes < 55: return 4
if minutes < 75: return 3
if minutes < 90: return 2
return 1
print_md("**Scoring function (Python, not LLM):**")
for m in [25, 35, 45, 60, 80, 100]:
print(f" {m:>3} min → score {commute_score_from_minutes(m)}/5")
Provider: nebius · Model: meta-llama/Llama-3.3-70B-Instruct ─────────────────────────────────────────────────────
Scoring function (Python, not LLM):
25 min → score 5/5 35 min → score 5/5 45 min → score 4/5 60 min → score 3/5 80 min → score 2/5 100 min → score 1/5
5 · Library walkthrough¶
Source: src/agentic_architectures/architectures/mental_loop.py.
Five nodes (each does one job):
_generate—with_structured_output(_CandidateActions)produces K distinct candidates._simulate— callswith_structured_output(_SimulatedOutcome)K times, one per candidate._decide— pure Pythonargmaxover simulated scores._explain— final LLM call to write the recommendation + tradeoff explanation.
Note _simulate runs sequentially across candidates. For latency-critical applications you could use langgraph.graph.parallel to run the K simulations concurrently — extension idea in § 11.3.
from agentic_architectures.architectures.mental_loop import _CandidateActions, _SimulatedOutcome
import json
print('--- CandidateActions schema ---')
print(json.dumps(_CandidateActions.model_json_schema(), indent=2)[:400] + '...')
print()
print('--- SimulatedOutcome schema ---')
print(json.dumps(_SimulatedOutcome.model_json_schema(), indent=2)[:500] + '...')
--- CandidateActions schema ---
{
"description": "K distinct candidate actions the agent could take.",
"properties": {
"actions": {
"description": "K distinct candidate actions. Each must be a SPECIFIC, actionable choice (not a vague principle). Each should represent a different strategy, not different wordings of the same one.",
"items": {
"type": "string"
},
"minItems": 2,
"title":...
--- SimulatedOutcome schema ---
{
"description": "Mental simulation of one candidate action's outcome.\n\nThe `predicted_metric` field is the key to the deterministic-scoring fix:\nif the caller passes a `scoring_fn` to `MentalLoop`, the LLM-supplied\n`overall_score` is **overridden** by `scoring_fn(predicted_metric)`. This\nsidesteps the LLM-as-Scorer flatness problem (Llama-style models compress\nscores into a narrow band regardless of the rubric).",
"properties": {
"predicted_outcome": {
"description": "A conc...
6 · State¶
| Field | Type | Set by |
|---|---|---|
task |
str |
caller |
candidate_actions |
list[str] |
_generate |
simulations |
list[dict] (action + outcome + score + …) |
_simulate |
chosen_action |
str |
_decide |
chosen_score |
int |
_decide |
explanation |
str |
_explain |
7 · Build the graph¶
from IPython.display import Image, display
arch = MentalLoop(n_candidates=3, scoring_fn=commute_score_from_minutes)
graph = arch.build()
display(Image(graph.get_graph().draw_mermaid_png()))
8 · Live run — commute decision with objective scoring¶
Concrete decision: pick the fastest 8 AM Monday commute from Brooklyn to Midtown Manhattan.
Why this task: Mental Loop only adds value if the simulator can discriminate between candidates. Subjective tasks (which exercise plan? which Python framework?) produce flat 4/5 scores because all options sound fine. Objective tasks where the simulator must commit to a specific predicted number (travel time in minutes) force the scoring to spread — a route predicted at 80 minutes must score lower than a route predicted at 35 minutes, whether or not the model wants to be nice.
The task explicitly anchors the score to predicted travel time so the simulator can't dodge.
TASK = (
"DECISION: Choose the fastest 8 AM Monday commute from Park Slope, Brooklyn "
"to Midtown Manhattan (about 8 miles / 13 km).\n\n"
"Generate 3 DISTINCT travel options (different modes / routes — e.g., subway, "
"driving + parking, bike + bridge, mix). For each candidate:\n"
" (1) state the option concretely (which subway line / which route),\n"
" (2) PREDICT TOTAL DOOR-TO-DOOR TRAVEL TIME IN MINUTES — populate the "
" `predicted_metric` field with this number (e.g. 35.0, 60.0).\n"
" (3) list realistic risks (delays, weather, cost).\n\n"
"The Python wrapper will compute the 1-5 score deterministically from your "
"predicted_metric — you don't need to think about it."
)
result = arch.run(TASK)
print_header("Recommendation")
print_md(result.output)
print()
print(f"CHOSEN_SCORE: {result.state['chosen_score']}/5")
print(f"CHOSEN_ACTION: {result.state['chosen_action']}")
print(f"ALL_SCORES: {result.metadata['scores']}")
Recommendation ────────────────────────────────────────────────────────────────────────────────────────────────────
The chosen action is to take the B subway line from 7th Avenue in Park Slope to 42nd Street Bryant Park in Midtown Manhattan. This option is justified by its predicted total door-to-door travel time of approximately 35 minutes, which is the fastest among the considered options. By choosing the subway, we accept the tradeoff of potentially less flexibility and comfort compared to driving or biking, but gain the benefit of a more predictable and efficient commute, outpacing the runner-up driving option which is susceptible to traffic conditions and parking delays. Overall, the subway option offers the best balance of speed and reliability for the Monday morning commute.
CHOSEN_SCORE: 5/5 CHOSEN_ACTION: Take the B subway line from 7th Avenue in Park Slope to 42nd Street Bryant Park in Midtown Manhattan ALL_SCORES: [5, 4, 4]
8.0 · What just happened, briefly¶
Three things to look at:
- Score distribution. Healthy: 2-5 range with clear winner. Pathology: all 4/5 (flat scores → arbitrary argmax).
- The runner-up. Mental Loop's value is the tradeoff explanation — what was given up by picking the winner over the runner-up. § 9 will highlight this.
- Risks listed. Each simulation should list ≥1 risk. If all risks are vague or missing, the simulator is being too optimistic.
8.1 · Per-candidate simulations — LLM-score vs Python-score side-by-side¶
for i, t in enumerate(result.trace, 1):
final_score = t['overall_score']
llm_score = t.get('llm_score', '?')
source = t.get('score_source', '?')
metric = t.get('predicted_metric', None)
print_step(
f"[{i}] Action · FINAL score {final_score}/5 · (LLM said {llm_score}/5, "
f"source={source}, predicted_metric={metric})",
t['action'][:200]
)
print_step(" Predicted outcome", t['predicted_outcome'][:300])
if t.get('risks'):
print_step(" Risks", "; ".join(t['risks'][:3]))
print()
# Also print a one-line summary of the comparison
llm_scores = [t.get('llm_score', t['overall_score']) for t in result.trace]
final_scores = [t['overall_score'] for t in result.trace]
print(f"LLM_SCORES_RAW: {llm_scores}")
print(f"PYTHON_SCORES_FINAL: {final_scores}")
print(f"LLM_SPREAD: {max(llm_scores) - min(llm_scores)} · PYTHON_SPREAD: {max(final_scores) - min(final_scores)}")
› [1] Action · FINAL score 5/5 · (LLM said 4/5, source=deterministic, predicted_metric=35.0)
Take the B subway line from 7th Avenue in Park Slope to 42nd Street Bryant Park in Midtown Manhattan
› Predicted outcome
The commute will take approximately 35 minutes, with a 5-minute walk to the subway station, a 20-minute ride on the B train, and a 10-minute walk to the final destination.
› Risks
Delays due to subway maintenance or construction; Crowded trains during rush hour; Inclement weather affecting walking time
› [2] Action · FINAL score 4/5 · (LLM said 4/5, source=deterministic, predicted_metric=50.0)
Drive from Park Slope to Midtown Manhattan via the Brooklyn Bridge and park at a garage near the destination
› Predicted outcome
The drive from Park Slope to Midtown Manhattan via the Brooklyn Bridge will take around 30-40 minutes, depending on traffic conditions, and then an additional 10-15 minutes to park at a garage near the destination.
› Risks
traffic congestion on the Brooklyn Bridge; difficulty finding parking near the destination; higher cost due to parking fees
› [3] Action · FINAL score 4/5 · (LLM said 4/5, source=deterministic, predicted_metric=50.0)
Ride a bike from Park Slope to Midtown Manhattan via the Brooklyn Bridge and park at a bike rack near the destination
› Predicted outcome
The bike ride from Park Slope to Midtown Manhattan via the Brooklyn Bridge will take approximately 45 minutes, considering the distance and potential traffic on the bridge. After arriving, it will take around 5 minutes to find a bike rack near the destination and park the bike. The total door-to-doo
› Risks
Inclement weather; Bike breakdown or accident; Difficulty finding a bike rack near the destination
LLM_SCORES_RAW: [4, 4, 4] PYTHON_SCORES_FINAL: [5, 4, 4] LLM_SPREAD: 0 · PYTHON_SPREAD: 1
9 · What we just observed¶
The cells above ran Mental Loop on the NYC commute decision with scoring_fn=commute_score_from_minutes — a deterministic Python function that converts predicted minutes → 1-5 score.
9.1 · Quantitative summary¶
| Metric | Value |
|---|---|
| Candidates generated | 3 |
| Chosen action | Take the B subway line from 7th Avenue in Park Slope to 42nd Street Bryant Park in Midtown Manhattan |
| Chosen FINAL score | 5/5 |
| LLM-raw score spread | 0 (often flat — the pathology) |
| Python FINAL score spread | 1 (the fix) |
| LLM-raw scores | [4, 4, 4] |
| Python FINAL scores | [5, 4, 4] |
9.2 · Per-candidate breakdown¶
| # | predicted_metric (min) | LLM score | Python score (final) | Action |
|---|---|---|---|---|
| 1 | 35.0 | 4/5 | 5/5 | Take the B subway line from 7th Avenue in Park Slope to 42nd… |
| 2 | 50.0 | 4/5 | 4/5 | Drive from Park Slope to Midtown Manhattan via the Brooklyn … |
| 3 | 50.0 | 4/5 | 4/5 | Ride a bike from Park Slope to Midtown Manhattan via the Bro… |
9.3 · Patterns surfaced in this run¶
The deterministic-scoring fix is doing its job. The LLM's own
overall_scorefield on these 3 candidates was[4, 4, 4](spread = 0, a narrow band — the familiar LLM-as-Scorer flatness pathology). The Python scoring function computed[5, 4, 4]from the LLM'spredicted_metricfield (spread = 1) — a real discriminating signal that the argmax can act on. This is the central lesson of Mental Loop: let the LLM predict the underlying number, let Python compute the score.Score spread comparison: LLM=0, Python=1. Python won by 1 points of dynamic range — exactly the improvement we built
scoring_fnfor.
9.4 · Final recommendation (verbatim)¶
(no explanation captured)
9.5 · The takeaway¶
The deterministic-scoring pattern is the canonical fix for LLM-as-Scorer flatness:
- LLM predicts the underlying NUMBER (
predicted_metric: float) — concrete, harder to fudge. - Python computes the SCORE from that number via a deterministic function — perfectly calibrated.
- The argmax now has signal even when the LLM compresses its own
overall_score.
For tasks with a measurable outcome (time, cost, error rate, throughput, accuracy), this pattern eliminates the entire class of "everything is 4/5" bugs. Use it whenever you can express the scoring criterion as Python.
10 · Try other providers¶
from agentic_architectures.llm.factory import provider_supports_structured_output
for p in ["openai", "anthropic"]:
key = settings.api_key_for(p)
if key is None or not key.get_secret_value():
print(f"[skip] {p}: no API key")
continue
if not provider_supports_structured_output(p):
print(f"[skip] {p}: no structured output")
continue
print_header(f"Re-running Mental Loop on {p}")
r = MentalLoop(llm=get_llm(provider=p), n_candidates=3).run(
"Choose a programming language to learn in 30 days for a backend job switch."
)
print(r.output[:300])
print(f" scores: {r.metadata['scores']}")
print()
[skip] openai: no API key [skip] anthropic: no API key
11 · Failure modes, safety, extensions¶
11.1 · Where this breaks¶
| Failure | Mechanism | Mitigation |
|---|---|---|
| Flat scores | All candidates scored similarly → argmax is arbitrary | Tighter rubric ("most should be 2-4"); use different model for simulator |
| Optimistic predictions | Simulator imagines best case for every candidate | Force risks field to be non-empty; use a pessimistic-bias prompt |
| Confabulated specifics | Predictions include invented numbers ("you'll save $1,247") | Disallow specific numbers in predicted_outcome unless asked |
| Narrow candidate diversity | K=3 candidates are minor variants of each other | Force one "unconventional" candidate in the generator prompt |
| Single-model blind spots | Same model generates and simulates → shared biases | Use a different model in the simulator seat |
11.2 · Production safety¶
- Mark predictions as predictions. Never let the simulated outcome ("you will lose 10 lbs") get presented to a user as a guarantee.
- Log all K simulations, not just the chosen one. Auditability — if the choice turns out wrong, you want to see what was rejected.
- The simulator can be a real model. For physics / finance / medicine, plug in a real simulator (a discrete-event sim, a Monte Carlo engine, an ODE solver) and have the LLM only orchestrate.
11.3 · Four extensions¶
- Parallel simulation.
_simulateruns K sub-LLM calls sequentially; replace withlanggraph.graph.parallelfor an N× latency win on slow models. - Two-stage Mental Loop. After picking the best candidate, run a second Mental Loop on its first sub-step. Bridges to Planning + Mental Loop hybrid.
- External simulator. Replace the LLM simulator with a calibrated real-world simulator. The LLM only handles candidate generation + final synthesis.
- Deterministic scoring (THE most important fix). When the outcome has a measurable metric (time, cost, error rate), DON'T let the LLM assign the score. Instead: extract just the underlying number via structured output (
predicted_minutes: int), then compute the score in Python via a deterministic function. This sidesteps the LLM-as-Scorer flatness problem entirely. The captured run in § 9 is the clearest demonstration of why this matters.
11.4 · What to read next¶
- 09 · Tree of Thoughts — tree-deep version of Mental Loop.
- 14 · Dry-Run — simulate one proposed action before live execution.
- 13 · Ensemble — parallel diverse-perspective version.
- 06 · PEV — verify each step's actual outcome (vs Mental Loop's predicted outcome).
11.5 · References¶
- Ha, D. & Schmidhuber, J. World Models. 2018. arXiv:1803.10122
- Hafner, D. et al. Dream to Control (Dreamer). 2019. arXiv:1912.01603
- Yao, S. et al. Tree of Thoughts. NeurIPS 2023. arXiv:2305.10601 — closely related: tree-deep simulation.