Architecture Overview#
MTF runs four analysis phases followed by an optional follow-up chat. Each analysis phase fans out parallel agents, collects their reports, synthesises them in a single debate call, and (where applicable) waits for user approval before proceeding.
Pipeline#
flowchart TD
Input(["π User Input\nphenomenon description + images + toolkit data"])
subgraph GPD ["π§ GPD MCP SERVERS (optional)"]
direction TB
GV["verification\nchecks 5.1β5.19"]
GE["errors\n104 error classes"]
GP["protocols\n47+ domain protocols"]
GC["conventions\n18 subfields"]
GPat["patterns\ncross-session memory"]
GS["skills\ndomain discovery"]
end
subgraph LIT ["β LITERATURE PHASE"]
direction TB
LC["Auto domain classification\n+ lock conventions via GPD subfield_defaults\n+ pre-fetch DOMAIN_PATTERNS"]
L["L1 Β· L2 Β· L3\nN parallel agents\narxiv + Semantic Scholar\n+ GPD: check_error_classes, route_protocol,\nlookup_pattern, add_pattern"]
LD["π Debate\nsynthesis call + dimensional check postscript"]
LS["Plausibility screen\nlimiting_case_check per hypothesis"]
LU{"User approval"}
LC --> L --> LD --> LS --> LU
LU -->|"reject: add feedback"| L
end
subgraph FIT ["β‘ FITTING / QUALITATIVE EVALUATION PHASE"]
direction TB
FitChoice{"--no-fitting?"}
FW["Pre-fetch FITTING_WARNINGS"]
FT["toolkit check"]
F["F1 Β· F2 Β· F3\nM parallel fitting agents\nlmfit + numpy/scipy + GPD tools"]
FC["Phase physics checks β PHYSICS_VERDICT"]
FD["π Debate (fitting)"]
FU{"User approval"}
QE["Q1 Β· Q2 Β· Q3\nN parallel qualitative eval agents\n+ same GPD tools as ReviewerAgent"]
QD["π Debate (qualitative)"]
QU{"User approval"}
FitChoice -->|"fitting enabled (default)"| FW
FW --> FT --> F --> FC --> FD --> FU
FitChoice -->|"--no-fitting"| QE
QE --> QD --> QU
end
subgraph REV ["β’ REVIEW PHASE"]
direction TB
R["R1 Β· R2 Β· R3\nK parallel reviewer agents\n+ GPD: get_checklist, run_check, check_error_classes,\nlookup_pattern, add_pattern"]
P["P1 Β· P2\nN parallel proposal agents\n+ GPD: lookup_pattern, check_error_classes"]
RD["π Review Debate\nphysics-first ranking + dimensional check postscript"]
PD["π Proposal Synthesis\ndeduplicated, priority-ranked measurement list"]
FR["Final Report\nreview verdicts + ## Proposed Measurements"]
R --> RD
P --> PD
RD --> FR
PD --> FR
end
subgraph CHAT ["β£ FOLLOW-UP CHAT (optional)"]
direction TB
CQ{"Follow-up\nquestions?"}
CA["FollowUpChatAgent\nfull memory context\nmulti-turn Q&A loop"]
CQ -->|"yes"| CA
CA -->|"next question"| CA
end
GPD -.->|"tools"| L
GPD -.->|"tools"| F
GPD -.->|"tools"| R
Report(["π Final Report"])
subgraph IMG ["βͺ IMAGE DIGEST"]
direction TB
I["ImageDigestAgent\nClaude vision API\nparallel per image"]
IM["IMAGE_DATA\nin SharedMemory"]
I --> IM
end
Input --> IMG
IMG --> LIT
LIT -->|"approved hypotheses"| FIT
FIT --> REV
REV --> CQ
CQ -->|"no"| Report
CA -->|"exit"| Report
Phase 0: File Digest#
MTFOrchestrator.run() runs this before any analysis phase so that all downstream agents
can access extracted numerical data.
Step-by-step:
ImageDigestAgentspawns oneFileDigestSubagentper file viaasyncio.gather()β all files are digested concurrently.Each
FileDigestSubagentbase64-encodes the file and callsmessages.create()directly (notsdk.query()) with a multimodal content block.Images (PNG, JPG, GIF, WebP): the system prompt instructs the model to extract plot type, axis labels and units, all data series as Python lists of numbers, key quantitative features (peaks, plateaus, slopes, error bars, fit parameters), embedded annotations, and a brief physical interpretation.
PDFs: processed in up to two passes when
config.pdf_enhanced_extraction = True(default).Pass 1 (general digest): the full PDF is sent with
_PDF_SYSTEM_PROMPT, which extracts document metadata, physical system, key equations, experimental methods, all reported numerical values, conclusions, and a Figure Inventory enumerating every figure by page.Pass 2 (figure extraction): the same PDF is sent again with
_FIGURE_EXTRACTION_PROMPT, which iterates page-by-page and extracts each figure individually β type, axes, data series as numerical arrays, quantitative features, and physical significance.Both results are concatenated into a single structured digest. When
pdf_enhanced_extraction = False, only Pass 1 runs (same as the pre-existing behaviour).
Each digest is stored in
SharedMemoryasMemoryKind.IMAGE_DATAwithsource_fileandfilenamemetadata.If more than one file was provided, a second synthesis
messages.create()call combines all individual digests into a unified cross-file analysis (stored as a separateIMAGE_DATAentry withfilename="cross_file_synthesis").
Why messages.create() and not sdk.query(): The agent SDK does not expose multimodal
content blocks. The messages.create() call constructs the content list directly, alternating
an image or document block with a text block in the same user message.
Phase 1: Literature#
Pre-flight (before first fan-out)#
Before the first fan-out, three setup steps run once:
Auto domain classification:
MTFOrchestrator._classify_domains()calls GPDroute_protocolandroute_skillwith the phenomenon description, parses known domain names from the responses, and overwritesconfig.physics_domainsfor the run (ephemeral β no persistence). Falls back to the configured default if no domains are detected. The detected domains (or fallback notice) are stored asDOMAIN_CLASSIFICATIONfor audit. Controlled byconfig.auto_detect_domains(defaultFalse).Convention locking: The phase calls GPD
subfield_defaultsonce per domain inconfig.physics_domainsand stores each result asMemoryKind.CONVENTIONS. Every subsequent agent β across all three phases β sees these locked conventions in its prompt context, preventing silent mismatches (Fourier sign, metric signature, natural-unit choices, etc.) between agents working on the same phenomenon.Domain pattern pre-fetch:
_prefetch_domain_patterns()callslookup_patternwithcategory="convention-pitfall"per domain, storing results asDOMAIN_PATTERNS. These cross-session patterns appear in everyLiteratureAgentprompt context automatically.
Debate loop#
The phase runs up to config.max_debate_rounds iterations:
Fan-out:
NLiteratureAgentinstances are created and allinvestigate()calls run concurrently viaasyncio.gather(). Each agent:Prepends a
SharedMemorycontext block to its prompt (containingUSER_FEEDBACK,IMAGE_DATA,CONVENTIONS, andDOMAIN_PATTERNSentries).Calls
sdk.query()(an agentic streaming loop) with tools: arxiv search, Semantic Scholar, and GPDcheck_error_classes,route_protocol,lookup_pattern,add_pattern.Inside the agentic loop, the model may invoke tools multiple times before producing its final text response.
The system prompt instructs the agent to: (a) call
route_protocolfirst, (b) search both databases, (c) callcheck_error_classesfor each proposed hypothesis, (d) produce a structured report classifying each hypothesis by basis (first-principles / semi-empirical / empirical), verification status, and known failure modes, (e) calladd_patternfor any systematic errors found in a class of papers.The final report is stored as
MemoryKind.LITERATURE.
Debate:
DebateEngine.synthesize(phase="literature")collects all N reports and issues one plainmessages.create()call (not agentic). The synthesis system prompt instructs the model to resolve contradictions and surface the strongest hypotheses. No physics-first ranking criterion is added for the literature phase.Plausibility screen:
_screen_hypothesis_plausibility()extracts candidate hypotheses from the synthesis text and runslimiting_case_checkon each (classical_limit,zero_coupling,large_N) viaasyncio.gather(). Results are shown to the user as[PASS]/[WARN]/[FAIL]badges before the approval gate, and written asPHYSICS_VERDICT. Ifconfig.auto_reject_physics_failures=True, CRITICAL-FAIL hypotheses are removed from the approved list (with a non-empty fallback).User approval: The synthesis and plausibility badges are displayed. If the user approves, hypothesis lines are extracted (lines containing the keywords
hypothesis,proposed,model, ortheory) and stored asMemoryKind.HYPOTHESIS. The phase returns those hypothesis strings to the orchestrator.Rejection: If the user rejects, they are asked for guidance, which is stored as
MemoryKind.USER_FEEDBACK. The loop repeats from step 1 β the new agents will see the feedback in their prompt context.Max-rounds fallback: If
max_debate_roundsis exhausted without explicit approval, the last synthesis is used and the pipeline continues.
Phase 2: Fitting#
Toolkit resolution#
Before any fitting agent runs, a probe FittingAgent is created and asked which toolkit
items it needs for each hypothesis (identify_needed_toolkit_items()). Any item prefixed
with MISSING: in the response triggers an interactive request to the user.
User-provided values are handled on two paths:
Fast path: if
compile(value, '<string>', 'eval')succeeds β i.e. the value is a valid single Python expression β it is evaluated witheval()and registered directly.Slow path: complex input (function definitions, CSV text, code snippets, datasheets) is passed to a
ToolBuilderAgent, which writes and executesexec()-based parsing code to produce structureddata_itemsandmodel_items, then registers them inToolkitRegistry. On failure, the raw string is stored as a fallback.
Pre-dispatch warnings (before fan-out)#
_prefetch_fitting_warnings() runs before any fitting agent starts. For each
(domain, hypothesis) pair it fans out:
lookup_pattern(domain, "sign-error", hypothesis[:200])lookup_pattern(domain, "convergence-issue", hypothesis[:200])check_error_classes(description=hypothesis[:500])
Results are stored as FITTING_WARNINGS and appear in every FittingAgent prompt context
automatically, giving agents advance warning of known pitfalls for that model type.
Fan-out and rate limiting#
Fitting agents are launched under asyncio.Semaphore(config.fitting_semaphore_limit)
(default 6) to prevent API saturation. Two fan-out modes:
fitting_scope="per_hypothesis"(default): spawnMagents for each hypothesis sequentially, collecting all results before moving to synthesis.fitting_scope="all": spawnM Γ N_hypothesesagents simultaneously (all concurrently, bounded only by the semaphore).
Each FittingAgent.fit():
Prepends memory context (
LITERATURE,DEBATE,USER_FEEDBACK,IMAGE_DATA,CONVENTIONS,FITTING_WARNINGS,DOMAIN_PATTERNS) to the prompt.Calls
sdk.query()β the agentic loop calls GPD tools in order:route_protocolβget_protocolβsubfield_defaults.Generates lmfit Python code following the retrieved protocolβs checkpoints.
Pre-exec convention check: calls
convention_checkon the generated code beforeexec(). OnFAIL, the violation is written toPHYSICS_VERDICTand the agent retries once with the violation text in context (controlled byconfig.fitting_convention_checkandconfig.fitting_max_convention_retries).Strips markdown code fences, then passes the code to
run_fitting_code(), whichexec()s it in a namespace pre-seeded withnumpy,lmfit,scipy, and the userβsdatadict fromToolkitRegistry. The code must assign its output toresult.The
resultdict must include:parameters,uncertainties,chi_squared,reduced_chi_squared,assessment,protocol_followed,physical_parameter_ranges, andprotocol_checkpoints_satisfied.The fit output is stored as
MemoryKind.FIT_RESULT.
Phase physics checks (after fan-out)#
After all fitting agents complete, _run_phase_physics_checks() runs checks 5.1
(dimensional consistency) and 5.3 (limiting cases) on each fit report via
asyncio.gather(). Non-empty results are stored as PHYSICS_VERDICT entries with
source="phase_physics_check". These populate the PHYSICS_VERDICT context block that
DebateEngine injects into the synthesis call.
Debate and approval#
All fit reports are passed to DebateEngine.synthesize(phase="fitting"). The synthesis
system prompt adds a physics-first ranking criterion:
Physical correctness takes priority over fit quality.
Physics checks (5.1, 5.2, 5.3, 5.18) pass/fail
Parsimony (fewer free parameters)
First-principles basis
ChiΒ² (tiebreaker only)
The CONVENTIONS and PHYSICS_VERDICT memory entries (now populated by the phase checks)
are appended to the user content block sent to the synthesis call. After the synthesis,
DebateEngine extracts LaTeX/dimensional expressions from the text and appends an objective
dimensional check postscript (stored as both part of DEBATE and as PHYSICS_VERDICT).
The fitting synthesis is shown to the user. If rejected, feedback is stored and the pipeline continues regardless (there is no retry loop in the fitting phase).
No-Fitting Mode (--no-fitting)#
When --no-fitting is passed (or config.fitting_enabled = False), the fitting phase is
replaced by a qualitative evaluation phase. N QualitativeEvaluationAgent instances
run concurrently via asyncio.gather(), receiving the same GPD tools as ReviewerAgent.
Each agent evaluates all approved hypotheses against:
Established physical theory and first-principles arguments
Literature context accumulated in
LITERATUREandDEBATEmemory entriesQuantitative features extracted from user-supplied images (
IMAGE_DATA)
For each hypothesis the agent produces a verdict (SUPPORTED / PLAUSIBLE / SPECULATIVE / REJECTED), the specific numerical data that would be needed to upgrade to a quantitative fit, and the single most decisive confirming or refuting measurement.
Results are synthesized via DebateEngine.synthesize(phase="qualitative"), stored as
QUALITATIVE_EVAL, and a FITTING_SKIPPED flag is written to memory. ReviewerAgent
reads both kinds in its extra_kinds so the review phase adapts its report accordingly.
The qualitative phase runs an approval loop (same as the fitting phase); rejected rounds store user feedback and repeat.
Phase 3: Review#
K ReviewerAgent instances run concurrently via asyncio.gather(). Each agent:
Prepends memory context (
LITERATURE,DEBATE,FIT_RESULT,USER_FEEDBACK,IMAGE_DATA,CONVENTIONS,PHYSICS_VERDICT,INTEGRITY_WARNING) β the broadest context window of any agent type.Calls
sdk.query()with all 8 GPD tools available. The system prompt instructs the agent to:Call
check_error_classesfirst (top-15 error classes for the domain).Call
get_checklistonce per physics domain to obtain check IDs.Run mandatory checks for each fit result:
run_checkwith IDs5.1(dimensional),5.2(symmetry),5.3(limiting cases),5.18(fit-family mismatch), plusdimensional_checkif explicit equations are present.Call
lookup_patternto surface previously recorded errors in the same domain.Call
add_patternfor any confirmed new error pattern, so it persists to future sessions via GPDβs~/.gpd/store.
Produces a verdict for each hypothesis: SUPPORTED / PLAUSIBLE / SPECULATIVE / REJECTED, citing specific check IDs (e.g.
"REJECTED β check 5.1 FAIL: units inconsistent").Stores the verdict report as
MemoryKind.REVIEW.
Second-pass verification loop: When config.reviewer_verification_passes > 1, after the initial fan-out each reviewer is queried a second time with its own first-pass report and asked: βDid you miss anything? Check every claim, equation, parameter range, and citation again.β The updated reports replace the first-pass outputs before synthesis. Default is 1 (single pass, same behaviour as before).
Multi-model diversity: When config.reviewer_models is non-empty (e.g. ['claude-opus-4-6', 'claude-sonnet-4-6']), reviewer agent instances cycle through the list. This provides adversarial diversity β different models tend to catch different errors.
DebateEngine.synthesize(phase="review") collects all K reports, applies the same
physics-first ranking criterion as the fitting phase, and returns the final report string.
There is no user approval gate after the review phase; the report is returned directly to the
caller.
Measurement Proposal Sub-Agents#
N ProposalAgent instances run concurrently with the reviewer agents in a single
asyncio.gather() call. Each agent reads the full accumulated memory context and proposes
a prioritized list of new experiments and measurements that would discriminate between
the competing hypotheses. Proposals specify: observable to measure, expected signal per
hypothesis, discriminating power (HIGH / MEDIUM / LOW), equipment requirements, and
required sensitivity.
DebateEngine.synthesize(phase="proposals") collects all N proposal reports and produces
a deduplicated, priority-ranked list (HIGH discriminating power first) with a single
βBottom lineβ recommendation. The result is stored as MemoryKind.PROPOSALS and appended
to the final report under a ## Proposed Measurements heading.
Both synthesis calls (review verdicts and proposals) complete before the final report is returned to the user.
Phase 4: Follow-up Chat#
After the final report is shown, MTFOrchestrator._run_followup_chat() offers an optional
interactive Q&A session.
Opt-in gate: the user is asked
"Would you like to ask follow-up questions?". Declining skips the phase entirely; the orchestrator returns the final report string unchanged.Single agent: one
FollowUpChatAgentis created. It has no tools β follow-up questions are answered purely from the fullSharedMemorycontext, which at this point contains allLITERATURE,DEBATE,HYPOTHESIS,FIT_RESULT,REVIEW,PROPOSALS,USER_FEEDBACK,IMAGE_DATA,CONVENTIONS,PHYSICS_VERDICT,FITTING_WARNINGS, andQUALITATIVE_EVALentries.The agentβs system prompt includes a pressure resistance paragraph: if the user pushes back on an analysis conclusion, the agent is instructed to review the specific evidence supporting the original conclusion rather than accommodating the userβs preference. Changing a verdict requires new evidence or a logical argument.
Multi-turn loop: each question is sent to
sdk.query()with the full memory context prepended and the accumulated conversation history appended. The history is formatted as an alternatingUser: β¦ / Assistant: β¦dialogue block and grows with each exchange, so the agent can refer back to earlier answers. The loop exits when the user submits an empty line or typesexit/quit.
Why a single agent (not a panel): The reviewer and proposal agents already produced their
specialised verdicts; those are stored in SharedMemory and injected into every follow-up
prompt automatically. A single agent answering from that rich context is faster and produces
more coherent multi-turn replies than re-running a full fan-out + debate cycle per question.
Debate Engine internals#
DebateEngine.synthesize() is always a single plain messages.create() call β never an
agentic loop β keeping synthesis fast and deterministic.
The call constructs its user content block by concatenating:
Full
SharedMemorycontext (all entries).extra_contextstring (typically the phenomenon description or hypothesis list).All
CONVENTIONSentries (if present).All
PHYSICS_VERDICTentries (if present).The numbered list of agent reports.
The system prompt is phase-dependent: for "fitting" and "review" phases, the
four-criterion physics-first ranking paragraph is appended. For "literature" it is omitted.
Anti-consensus instruction (non-proposals phases): For all phases except proposals (i.e. literature, fitting, review, and qualitative), the system prompt includes an instruction to preserve genuine disagreements rather than smoothing them into false consensus. If agents reach contradictory conclusions, both views are presented with the evidential tension explained. This instruction is intentionally not applied to the proposals phase, which explicitly requires deduplication and merging.
Dimensional check postscript (fitting and review phases only): After the messages.create()
call, DebateEngine._append_dimensional_check() extracts LaTeX inline equations ($...$) and
dimensional expressions ([M][L]...) from the synthesis text (up to 5 candidates) and calls
dimensional_check on them. The result is appended to the synthesis as an
--- OBJECTIVE DIMENSIONAL CHECK --- block and also stored as PHYSICS_VERDICT with
source="debate_dimensional_check". This is a pure postscript β no second LLM call β and is
intentionally embedded in the DEBATE entry so downstream agents see it alongside the synthesis.
The synthesis output (including any postscript) is stored as MemoryKind.DEBATE with a phase
metadata tag, making it visible to all subsequent agents via _build_prompt().
Key design decisions#
Shared memory over retrieval. SharedMemory is a plain Python list of MemoryEntry
objects passed by reference. BaseAgent._build_prompt() calls memory.format_context()
to prepend a === SHARED CONTEXT === β¦ === END CONTEXT === block before the task text,
so agents always see prior debate summaries and user feedback without a separate retrieval
step.
Debate is a single API call. Synthesis is one messages.create() call for speed and
predictability. An agentic synthesis loop would be slower and harder to reason about.
Fitting code runs via exec(). The fitting agent generates Python code; run_fitting_code()
executes it in a namespace pre-populated with numpy, lmfit, scipy, and the userβs data
dict. The code must assign its output to result. Markdown fences are stripped before
execution.
Image digestion uses the multimodal API directly. ImageDigestAgent calls
messages.create() with base64-encoded image/PDF content blocks β not sdk.query() β because
the agent SDK does not expose multimodal content blocks.
Concurrency is bounded. Fitting agents are rate-limited by
asyncio.Semaphore(config.fitting_semaphore_limit) (default 6) to prevent API saturation
when fitting_scope="per_hypothesis" spawns N_hypotheses Γ M concurrent agents.
Fitting integrity checks guard against fabrication. run_fitting_code() wraps lmfit.minimize and Model in sentinels; if no real optimizer call is detected post-exec, an INTEGRITY_WARNING is stored in SharedMemory and becomes visible to ReviewerAgent. This directly addresses the risk of LLMs hardcoding plausible-looking result values.
File layout#
mtf/
βββ config.py MTFConfig dataclass
βββ memory.py SharedMemory + MemoryEntry + MemoryKind
βββ debate.py DebateEngine
βββ interface.py HumanInterface ABC + CLIInterface
βββ gui.py StreamlitInterface + Streamlit app
βββ orchestrator.py MTFOrchestrator.run()
βββ cli.py mtf entry point
βββ utils.py Shared helpers β strip_fences()
βββ agents/
β βββ base.py BaseAgent (sdk.query wrapper + _build_prompt)
β βββ image_digest.py ImageDigestAgent + FileDigestSubagent
β βββ literature.py LiteratureAgent
β βββ fitting.py FittingAgent
β βββ qualitative.py QualitativeEvaluationAgent (--no-fitting mode)
β βββ reviewer.py ReviewerAgent
β βββ proposal.py ProposalAgent
β βββ tool_builder.py ToolBuilderAgent
β βββ followup.py FollowUpChatAgent (post-report Q&A)
βββ phases/
β βββ literature_phase.py convention lock + debate loop + approval gate
β βββ fitting_phase.py toolkit resolution + fan-out + debate
β βββ qualitative_phase.py fan-out qualitative eval + debate (--no-fitting mode)
β βββ review_phase.py fan-out + final debate
βββ tools/
β βββ arxiv_search.py sdk.Tool wrapping arxiv.Client
β βββ semantic_search.py sdk.Tool wrapping semanticscholar API
β βββ fitting_tools.py run_fitting_code() β exec-based sandboxed runner
β βββ gpd_mcp.py GPDMCPClient
βββ toolkit/
βββ registry.py ToolkitRegistry