The deterministic-picker pattern¶
This is the central technical pattern of the repo. It shows up in 8+ architectures and is the universal escape from the LLM-as-Scorer flat-band pathology.
The problem¶
Ask Llama-3.3-70B (or most instruction-tuned LLMs) to emit a single numeric quality score on a 1-5 or 1-10 scale, and you get this:
Regardless of how strict the rubric. Even when prompted "be calibrated, reserve 5/5 for genuine excellence", the model collapses to a narrow band. This is documented in Tree of Thoughts (nb 09), Mental Loop (nb 10), Ensemble (nb 13).
Architectures that depend on the score to pick something (beam search, MCTS, ranked retrieval, accept-or-reject loops) become arbitrary — there's no signal to discriminate on.
The fix¶
Don't ask the LLM for a number. Ask it for categorical features the score will be composed from, then have Python compose the deciding signal:
class _EditorCritique(BaseModel):
is_on_brief: bool # LLM commits to a bool, not a number
word_count: int
has_concrete_imagery: bool
avoids_cliches: bool
is_engaging: bool
def _composite_score(features: dict, wc_range: tuple) -> int:
score = 4 * features["is_on_brief"]
score += 2 if wc_range[0] <= features["word_count"] <= wc_range[1] else 0
score += 2 * features["has_concrete_imagery"]
score += 1 * features["avoids_cliches"]
score += 1 * features["is_engaging"]
return score # 0-10, with REAL SPREAD
The LLM can't flat-band 5 independent booleans the way it flat-bands one number. Python's score now ranges over [0, 10] honestly because it depends on 5 separate commitments.
Why this works¶
- Granular commitment. Saying "yes, this avoids clichés" is a different cognitive operation than saying "this is a 6/10".
- Auditable. You can show the user which features drove the score.
- Python computes the number. The LLM never emits the deciding signal directly.
Where the pattern shows up¶
| Architecture | LLM commits to | Python composes |
|---|---|---|
| Mental Loop (nb 10) | predicted_metric: float |
scoring_fn(predicted_metric) → int |
| Ensemble (nb 13) | categorical_answer: str |
Counter(answers).most_common(1) |
| Dry-Run (nb 14) | irreversibility: int 1-5 |
approved = irreversibility < threshold |
| RLHF Self-Improvement (nb 15) | 5 booleans + word_count | weighted composite |
| Reflexive Metacognitive (nb 17) | requires_credentials: bool, capability_match: int |
if creds or cap<=2: route='escalate' |
| Self-Consistency (nb 21) | per-sample answer: str |
Counter majority vote |
| LATS (nb 22) | (makes_progress, is_complete, avoids_loops, confidence) |
5*complete + 2*progress + 1*no_loops + conf_weight |
| Corrective RAG (nb 24) | per-doc Literal[relevant, ambiguous, irrelevant] |
route from label counts |
| Self-RAG (nb 25) | per-doc 3 categorical reflection tokens | Python AND is_relevant != not_relevant AND is_supported != no_support |
| Adaptive RAG (nb 26) | complexity: Literal[no_retrieval, single, multi] |
if/elif route |
| Debate (nb 28) | per-agent answer: str |
Counter on final round |
| Constitutional AI (nb 32) | per-rule verdict: Literal[pass, fail] |
all(v == "pass") |
| BrowserAgent (nb 34) | structured action with target |
_check_safety(action) → allowed: bool |
Architecturally immune by design¶
Some architectures have no LLM-as-Scorer step at all because their decisions are categorical or content-based:
- Reflexion (nb 18) — pass/fail is a pure-Python checker (
default_haiku_checker); recall is vector similarity (FAISS does its job) - Self-Discover (nb 19) — SELECT picks indices; ADAPT/IMPLEMENT produce text; SOLVE produces an answer
- CoVe (nb 20) — REVISE makes keep/drop decisions per claim; confidence is categorical
- GraphRAG (nb 27) — local vs global is categorical; traversal is mechanical
- Voyager (nb 29) — reuse vs write_new is categorical; skills execute in subprocess
- MemGPT (nb 31) — action is
Literal[write_to_archival, search_archival, answer] - SWE-Agent (nb 33) — action is
Literal[list, read, write, run_check, answer] - AWM (nb 35) — retrieve match / no-match
The takeaway¶
Whenever an architecture has a picker — a step that ranks, scores, or selects — apply this discipline:
- Identify the categorical features the picker should decide on.
- Pydantic-schema them with strict types (
bool,intwith bounds,Literal[...]). - Compose the deciding signal in Python using those features.
- Keep the LLM's numeric output (if any) on the trace for comparison only — never as the deciding value.
This pattern is architectural, not a hyperparameter. Once you build it in, the architecture is immune to the flat-band pathology for the lifetime of the codebase.