The deterministic-picker pattern¶

This is the central technical pattern of the repo. It shows up in 8+ architectures and is the universal escape from the LLM-as-Scorer flat-band pathology.

The problem¶

Ask Llama-3.3-70B (or most instruction-tuned LLMs) to emit a single numeric quality score on a 1-5 or 1-10 scale, and you get this:

sample 1: score=4/5
sample 2: score=4/5
sample 3: score=4/5
sample 4: score=4/5

Regardless of how strict the rubric. Even when prompted "be calibrated, reserve 5/5 for genuine excellence", the model collapses to a narrow band. This is documented in Tree of Thoughts (nb 09), Mental Loop (nb 10), Ensemble (nb 13).

Architectures that depend on the score to pick something (beam search, MCTS, ranked retrieval, accept-or-reject loops) become arbitrary — there's no signal to discriminate on.

The fix¶

Don't ask the LLM for a number. Ask it for categorical features the score will be composed from, then have Python compose the deciding signal:

class _EditorCritique(BaseModel):
    is_on_brief: bool                          # LLM commits to a bool, not a number
    word_count: int
    has_concrete_imagery: bool
    avoids_cliches: bool
    is_engaging: bool

def _composite_score(features: dict, wc_range: tuple) -> int:
    score = 4 * features["is_on_brief"]
    score += 2 if wc_range[0] <= features["word_count"] <= wc_range[1] else 0
    score += 2 * features["has_concrete_imagery"]
    score += 1 * features["avoids_cliches"]
    score += 1 * features["is_engaging"]
    return score  # 0-10, with REAL SPREAD

The LLM can't flat-band 5 independent booleans the way it flat-bands one number. Python's score now ranges over [0, 10] honestly because it depends on 5 separate commitments.

Why this works¶

Granular commitment. Saying "yes, this avoids clichés" is a different cognitive operation than saying "this is a 6/10".
Auditable. You can show the user which features drove the score.
Python computes the number. The LLM never emits the deciding signal directly.

Where the pattern shows up¶

Architecture	LLM commits to	Python composes
Mental Loop (nb 10)	`predicted_metric: float`	`scoring_fn(predicted_metric) → int`
Ensemble (nb 13)	`categorical_answer: str`	`Counter(answers).most_common(1)`
Dry-Run (nb 14)	`irreversibility: int 1-5`	`approved = irreversibility < threshold`
RLHF Self-Improvement (nb 15)	5 booleans + word_count	weighted composite
Reflexive Metacognitive (nb 17)	`requires_credentials: bool, capability_match: int`	`if creds or cap<=2: route='escalate'`
Self-Consistency (nb 21)	per-sample `answer: str`	`Counter` majority vote
LATS (nb 22)	`(makes_progress, is_complete, avoids_loops, confidence)`	`5complete + 2progress + 1*no_loops + conf_weight`
Corrective RAG (nb 24)	per-doc `Literal[relevant, ambiguous, irrelevant]`	route from label counts
Self-RAG (nb 25)	per-doc 3 categorical reflection tokens	Python AND `is_relevant != not_relevant AND is_supported != no_support`
Adaptive RAG (nb 26)	`complexity: Literal[no_retrieval, single, multi]`	`if/elif` route
Debate (nb 28)	per-agent `answer: str`	`Counter` on final round
Constitutional AI (nb 32)	per-rule `verdict: Literal[pass, fail]`	`all(v == "pass")`
BrowserAgent (nb 34)	structured action with `target`	`_check_safety(action) → allowed: bool`

Architecturally immune by design¶

Some architectures have no LLM-as-Scorer step at all because their decisions are categorical or content-based:

Reflexion (nb 18) — pass/fail is a pure-Python checker (default_haiku_checker); recall is vector similarity (FAISS does its job)
Self-Discover (nb 19) — SELECT picks indices; ADAPT/IMPLEMENT produce text; SOLVE produces an answer
CoVe (nb 20) — REVISE makes keep/drop decisions per claim; confidence is categorical
GraphRAG (nb 27) — local vs global is categorical; traversal is mechanical
Voyager (nb 29) — reuse vs write_new is categorical; skills execute in subprocess
MemGPT (nb 31) — action is Literal[write_to_archival, search_archival, answer]
SWE-Agent (nb 33) — action is Literal[list, read, write, run_check, answer]
AWM (nb 35) — retrieve match / no-match

The takeaway¶

Whenever an architecture has a picker — a step that ranks, scores, or selects — apply this discipline:

Identify the categorical features the picker should decide on.
Pydantic-schema them with strict types (bool, int with bounds, Literal[...]).
Compose the deciding signal in Python using those features.
Keep the LLM's numeric output (if any) on the trace for comparison only — never as the deciding value.

This pattern is architectural, not a hyperparameter. Once you build it in, the architecture is immune to the flat-band pathology for the lifetime of the codebase.