Agents#
All agents except ImageDigestAgent / FileDigestSubagent extend BaseAgent, which wraps
sdk.query() and prepends a SharedMemory context block before every call.
BaseAgent#
BaseAgent (mtf/agents/base.py) is the foundation for all agentic calls.
_build_prompt(task, extra_kinds)
Calls memory.format_context(*extra_kinds) to produce:
=== SHARED CONTEXT ===
[USER_FEEDBACK] …
[IMAGE_DATA] …
[CONVENTIONS] …
=== END CONTEXT ===
This block is prepended to the task string before being sent to sdk.query(). Each
concrete agent specifies which MemoryKind values it needs via extra_kinds — agents
only see the memory entries relevant to their role.
After the context block and task string, _build_prompt() unconditionally appends two safety suffixes:
_HONESTY_REMINDER— instructs the agent not to use shortcut phrases like ‘this becomes’ or ‘for consistency’ to skip steps, and not to claim verification unless explicitly performed._CONVENTION_REMINDER— appended only whenCONVENTIONSentries are present in memory; reminds the agent that physics conventions are locked for this run and must not revert to textbook defaults.
_query(task, extra_kinds)
Builds the prompt, then iterates over sdk.query() chunks collecting text. The agentic
loop inside sdk.query() handles multi-turn tool use automatically; _query() collects
only final text chunks.
FileDigestSubagent#
Used by |
|
API |
|
Memory written |
None (results returned to |
A stateless, leaf-level worker that digests exactly one file. It does not touch
SharedMemory — the coordinating ImageDigestAgent handles storage.
Processing:
Images (PNG, JPG, GIF, WebP): sends a content block of type
"image"with base64 source data alongside a text prompt. The system prompt instructs extraction of: plot type, all axis labels and units and scale, every data series as a Python list of numbers, key quantitative features (peaks, plateaus, slopes, error bars, fit parameters), embedded annotations, and a brief physical interpretation.PDFs: sends a content block of type
"document"with base64 PDF data.Standard pass (
_PDF_SYSTEM_PROMPT): extracts document type, title, authors, physical system, key equations (reproduced symbolically), experimental methods and parameters, all reported numerical values with units, conclusions, and a Figure Inventory listing every figure by page number and caption.Figure-extraction pass (
_FIGURE_EXTRACTION_PROMPT, enabled whenconfig.pdf_enhanced_extraction = True): a second API call with the same document block, using a dedicated prompt that iterates page-by-page and extracts each figure individually — type, axes (labels, units, scale, range), every data series as numerical arrays, key quantitative features, and physical significance. The two pass results are combined into a single sectioned digest:## General Document Digestfollowed by## Figure-by-Figure Extraction.
The MIME type is detected via mimetypes.guess_type(); unrecognised formats default to
image/png.
ImageDigestAgent#
Phase |
⓪ Pre-processing |
API |
|
Memory written |
|
Coordinates parallel file digestion and optional cross-file synthesis.
digest_all(file_paths) — called by the orchestrator:
Spawns one
FileDigestSubagentper file, all running concurrently viaasyncio.gather().Stores each digest in
SharedMemoryasIMAGE_DATAwithsource_fileandfilenamemetadata.If more than one file was provided, issues a third
messages.create()call (the synthesis call) that receives all individual digests as sections of one user message and produces a unified cross-file analysis. This synthesis is stored as a separateIMAGE_DATAentry withfilename="cross_file_synthesis".
The synthesis system prompt asks the model to: summarise the combined experiment, consolidate all numerical data (identifying shared axes, flagging contradictions), describe physical connections and patterns across files, produce a unified list of key quantitative features, and highlight open questions or anomalies.
Why messages.create() and not sdk.query(): The agent SDK does not expose
multimodal content blocks. The Anthropic messages API is called directly so that
image/document blocks can be placed in the content list alongside text blocks.
LiteratureAgent#
Phase |
① Literature |
API |
|
Tools |
arxiv search, Semantic Scholar, GPD: |
Memory context read |
|
Memory written |
|
N instances run concurrently in each debate round (asyncio.gather()).
Accepts an optional config: MTFConfig parameter. When config.citation_verification = True (default), the task prompt includes a citation re-verification step capped at config.citation_verification_max (default 10) citations.
Agentic loop (inside sdk.query()):
The system prompt instructs a fixed tool-call order:
Call
route_protocolwith a description of the phenomenon to identify what computation methodology the relevant papers should follow.Search arxiv and Semantic Scholar thoroughly, prioritising recent, highly-cited work.
For each proposed hypothesis, call
check_error_classesto get the top-15 most likely physics error classes — error-prone approaches are flagged in the report.If systematic errors are found in a class of papers (convention-pitfalls, sign errors, missing factors), call
add_patternto record them in the cross-session pattern store.Re-verify up to
config.citation_verification_maxof the most important citations by calling the search tool again with the exact paper title to cross-check author names, year, and venue. Unverified citations are flagged[UNVERIFIED: <reason>].
The agent also receives DOMAIN_PATTERNS in context — pre-fetched pitfall patterns for the
physics domain, written to memory before the fan-out.
Report structure produced:
Summary of the phenomenon
Most relevant papers with citations
Hypotheses ranked by plausibility, each classified by:
Basis: first-principles / semi-empirical / purely empirical
Verification status: experimentally confirmed / theoretical prediction / disputed
Known failure modes (from
check_error_classes)
Key equations or models from the literature
Error-prone aspects per hypothesis
The report is stored as LITERATURE and returned to the phase for debate.
FittingAgent#
Phase |
② Fitting |
API |
|
Tools |
GPD: |
Memory context read |
|
Memory written |
|
M instances run per hypothesis, all rate-limited by asyncio.Semaphore.
identify_needed_toolkit_items(hypothesis) — probe call before the main fan-out:
Asks the agent which data items and model functions are needed. Items prefixed with
MISSING: in the response trigger an interactive toolkit-resolution loop in the phase.
This probe also reads FITTING_WARNINGS and DOMAIN_PATTERNS from context.
fit(hypothesis) — agentic loop:
The system prompt instructs:
Call
route_protocolwith a description of what is being fit.Call
get_protocolwith the returned protocol name to get the full step-by-step methodology with mandatory checkpoints — used as a blueprint for the fitting code.Call
subfield_defaultswith the relevant subfield to get canonical conventions (sign, Fourier, natural units, gauge) and embed them in the code.Write lmfit Python code following the protocol’s checkpoints.
If the fit fails to converge or produces unphysical parameters, call
add_patternto record the convergence issue in the cross-session pattern store.Anti-fabrication rule: the result dict must be populated directly from the lmfit
MinimizerResult— hardcoding result values is explicitly prohibited.
The agent receives FITTING_WARNINGS and DOMAIN_PATTERNS in context — pre-fetched
pitfall warnings for the specific hypothesis/domain combination, written before the fan-out.
Pre-exec convention check (phase-level, not agent-level):
After the agentic loop generates code, fit() calls convention_check on the generated
code before exec(). If the check returns FAIL, the violation is written to
PHYSICS_VERDICT and the agent retries once with the violation text in context
(controlled by config.fitting_convention_check and config.fitting_max_convention_retries).
The generated code is stripped of markdown fences, then executed by run_fitting_code()
via exec() in a namespace seeded with numpy, lmfit, scipy, and the user’s data
dict from ToolkitRegistry. The code must assign its output to a variable named result.
Post-exec integrity checks (when config.fitting_result_integrity_check = True, default): run_fitting_code() wraps lmfit.minimize and Model in sentinels to detect whether a real optimizer call was made. After exec(), _validate_result() checks: optimizer was called (warns if not — possible hardcoded result), chi² ≥ 0 (negative is physically impossible), parameters dict non-empty. Any warnings are stored as MemoryKind.INTEGRITY_WARNING with source='fitting_integrity_check'.
The result dict must contain:
Key |
Meaning |
|---|---|
|
Best-fit parameter values |
|
Parameter uncertainties |
|
χ² of the fit |
|
Reduced χ² |
|
Narrative quality assessment |
|
Name of the GPD protocol retrieved |
|
Map of parameter → whether it falls within physical bounds |
|
List of checkpoint names that passed |
The fit output and hypothesis text are stored as FIT_RESULT.
QualitativeEvaluationAgent#
Phase |
② Qualitative Evaluation (when |
API |
|
Tools |
GPD: |
Memory context read |
|
Memory written |
|
Used instead of FittingAgent when the pipeline runs with --no-fitting. N instances run
concurrently. Each evaluates all hypotheses against established theory, literature context,
and image-extracted data — without numerical fitting. For each hypothesis, the agent produces:
a theoretical plausibility assessment, expected observational signatures, the specific data
that would be needed to upgrade the assessment to a quantitative fit, a verdict
(SUPPORTED / PLAUSIBLE / SPECULATIVE / REJECTED), and the single most decisive measurement
that would confirm or refute it. Results are synthesized via
DebateEngine.synthesize(phase="qualitative") and stored as QUALITATIVE_EVAL. The
ReviewerAgent reads QUALITATIVE_EVAL in its extra_kinds so the review phase adapts its
tone accordingly.
ReviewerAgent#
Phase |
③ Review |
API |
|
Tools |
GPD: |
Memory context read |
|
Memory written |
|
K instances run concurrently. ReviewerAgent reads the widest memory context of any
agent type — it sees everything accumulated across all prior phases.
Agentic loop (inside sdk.query()):
The system prompt mandates the following sequence:
check_error_classes— identify the top-15 most relevant error classes to watch for.get_checklist— fetch the domain-specific check list; call once per physics domain and merge the lists if the phenomenon spans multiple domains.For each fit result, run mandatory checks:
run_check("5.1", …)— dimensional consistencyrun_check("5.2", …)— symmetry requirementsrun_check("5.3", …)— limiting cases (does the model recover known limits?)run_check("5.18", …)— fit-family mismatch (is the model family appropriate?)dimensional_check— if explicit equations appear in the fit results
lookup_pattern— surface previously recorded errors in the same domain/category.add_pattern— record any confirmed new error for future cross-session use.
Verdict format:
Each hypothesis receives exactly one verdict label:
SUPPORTED / PLAUSIBLE / SPECULATIVE / REJECTED
with the relevant check IDs cited, e.g.:
REJECTED — check 5.1 FAIL: units of σ inconsistent with RHS
Hypotheses are ranked by: (1) physics check results, (2) parsimony, (3) first-principles basis, (4) chi² last — mirroring the ranking criterion in the debate synthesis.
The review report is stored as REVIEW.
Exhaustive review requirement: The system prompt instructs the agent to re-read its entire review after completing the main steps and enumerate all issues found — not just the most prominent one — under an ‘Additional concerns:’ section.
ProposalAgent#
Phase |
③ Review (parallel with ReviewerAgent) |
API |
|
Tools |
GPD: |
Memory context read |
|
Memory written |
|
Runs concurrently with ReviewerAgent instances inside the review phase. Proposes a prioritized set of new measurements and experiments that would best discriminate between competing hypotheses. Each proposal includes: observable to measure, expected signal per hypothesis, discriminating power (HIGH / MEDIUM / LOW), equipment requirements, and required sensitivity. A “Bottom line” recommendation names the single most cost-effective measurement. Results are synthesized via DebateEngine.synthesize(phase="proposals") into a deduplicated, ranked proposals list that is appended to the final report as ## Proposed Measurements.
FollowUpChatAgent#
Phase |
④ Follow-up Chat (post-report, optional) |
API |
|
Memory context read |
|
Memory written |
None |
Created by MTFOrchestrator._run_followup_chat() after the final report is shown. A single
instance handles the entire Q&A session; no tools are provided because the full SharedMemory
context already contains all analysis results.
Multi-turn memory: FollowUpChatAgent maintains a local _history list. Each chat()
call prepends the accumulated User: … / Assistant: … dialogue to the task string before
calling _query(), giving the agent conversational memory across turns despite sdk.query()
being stateless per call.
Pressure resistance: The system prompt includes an explicit instruction that changing a position requires new evidence or a logical argument — user insistence alone is not sufficient. The agent is directed to cite specific results, fit parameters, or reviewer verdicts when defending a conclusion, and is permitted to disagree with the user.
Loop behaviour: managed by the orchestrator — questions are read via interface.ask(),
responses are displayed via interface.show(), and the loop exits on empty input or
exit / quit.
ToolBuilderAgent#
Phase |
② Fitting (on demand) |
API |
|
Memory written |
|
Invoked only when a fitting agent identifies a required toolkit item that the user supplies
in a complex form (multi-line function definition, CSV text, datasheet). The agent writes
exec()-based parsing code to convert the raw input into structured data_items and
model_items dicts, then registers them in ToolkitRegistry.
If parsing fails, the raw string is stored as a fallback and the phase reports the error.