Pipeline & CLI¶

TrialMatchAI is one end-to-end pipeline built from an ordered registry of idempotent stages. Every command is a slice of this pipeline. Because each stage detects and skips work that is already done, a run "just works" from any starting state — finished stages are cheap no-ops, unfinished ones run.

The stages¶

#	Stage	What it does	Idempotency check
1	`prepare`	embed + entity-annotate the trial corpus	per-trial prepared file
2	`concepts`	build the entity-linking concept store	concept table present
3	`index`	build the LanceDB search tables	both tables present
4	`ingest`	import patient inputs into canonical profiles	per-patient profile
5	`expand`	CoT query expansion of patient summaries	`query_expanded` marker
6	`match`	retrieval + reranking + CoT eligibility + ranking	per-patient `ranked_trials.json`
7	`eval`	score results against qrels (benchmark runs)	benchmark-only

The single command¶

trialmatchai pipeline [selection] [options]

Selection — run any subset (the unit of modularity is the stage):

Flag	Meaning	Example
(none)	run every stage, skipping what's done	`trialmatchai pipeline`
`--only`	run exactly these stages	`--only match,eval`
`--from` / `--to`	run a contiguous slice	`--from index --to match`
`--skip`	omit stages (great for ablation)	`--skip expand`
`--force`	redo stages even if done (`all` = everything)	`--force match`

Options: --input (repeatable patient files/dirs), --format, --trials-json-folder, --processed-trials-folder, --processed-criteria-folder, --concepts / --concepts-csv / --synonym-csv, --config.

trialmatchai pipeline --only prepare,index             # build the search index
trialmatchai pipeline --input patient.fhir.json        # ingest + match one patient
trialmatchai pipeline --skip concepts,expand           # leaner run for an ablation
trialmatchai pipeline --force all                       # rebuild everything from scratch

Ablation¶

Stage flags double as ablation knobs — toggle a component and compare:

trialmatchai pipeline --skip expand     # matching without LLM query expansion
trialmatchai pipeline --skip concepts   # without entity-concept linking

Component backends (reranker, CoT, search mode bm25/vector/hybrid) are set in the config; see Architecture.

Presets (the same pipeline, named)¶

These are thin wrappers over the pipeline that add their own setup:

Command	Equivalent slice	Adds
`trialmatchai build`	`--to index`	build manifest; bootstrap-aware prepare
`trialmatchai e2e`	`--from index --to match`	patient ingestion convenience
`trialmatchai trec`	`--from index --to eval` (per track)	official topics + qrels + corpus restriction
`trialmatchai run`	`--only match`	match already-staged profiles
`trialmatchai index`	`--only prepare,index`	—

Every command is idempotent and resumable: re-running continues from the last completed work.

Reports¶

Matching writes a self-contained, offline HTML report (no server, no build step, no CDN). It is emitted automatically at the end of a run — <output_dir>/<patient_id>/report.html per patient and <output_dir>/index.html as a front page across all patients — and can be regenerated from existing results without re-matching:

trialmatchai report --patient 1009     # one patient        -> <output_dir>/1009/report.html
trialmatchai report --all              # unified front page -> <output_dir>/index.html

Auto-emit is gated by reporting.emit_html (default true; disabled for trec benchmark sweeps).

Python API¶

from trialmatchai.config.config_loader import load_config
from trialmatchai.pipeline import StageContext, run_pipeline

ctx = StageContext(config=load_config(), inputs=["patient.txt"])
run_pipeline(ctx, from_stage="index", to_stage="match")

See the API reference for StageContext, Stage, select_stages, and run_pipeline.