Skip to content

Pipeline & CLI

TrialMatchAI is one end-to-end pipeline built from an ordered registry of idempotent stages. Every command is a slice of this pipeline. Because each stage detects and skips work that is already done, a run "just works" from any starting state — finished stages are cheap no-ops, unfinished ones run.

The stages

# Stage What it does Idempotency check
1 prepare embed + entity-annotate the trial corpus per-trial prepared file
2 concepts build the entity-linking concept store concept table present
3 index build the LanceDB search tables both tables present
4 ingest import patient inputs into canonical profiles per-patient profile
5 expand CoT query expansion of patient summaries query_expanded marker
6 match retrieval + reranking + CoT eligibility + ranking per-patient ranked_trials.json
7 eval score results against qrels (benchmark runs) benchmark-only

The single command

trialmatchai pipeline [selection] [options]

Selection — run any subset (the unit of modularity is the stage):

Flag Meaning Example
(none) run every stage, skipping what's done trialmatchai pipeline
--only run exactly these stages --only match,eval
--from / --to run a contiguous slice --from index --to match
--skip omit stages (great for ablation) --skip expand
--force redo stages even if done (all = everything) --force match

Options: --input (repeatable patient files/dirs), --format, --trials-json-folder, --processed-trials-folder, --processed-criteria-folder, --concepts / --concepts-csv / --synonym-csv, --config.

trialmatchai pipeline --only prepare,index             # build the search index
trialmatchai pipeline --input patient.fhir.json        # ingest + match one patient
trialmatchai pipeline --skip concepts,expand           # leaner run for an ablation
trialmatchai pipeline --force all                       # rebuild everything from scratch

Ablation

Stage flags double as ablation knobs — toggle a component and compare:

trialmatchai pipeline --skip expand     # matching without LLM query expansion
trialmatchai pipeline --skip concepts   # without entity-concept linking

Component backends (reranker, CoT, search mode bm25/vector/hybrid) are set in the config; see Architecture.

Presets (the same pipeline, named)

These are thin wrappers over the pipeline that add their own setup:

Command Equivalent slice Adds
trialmatchai build --to index build manifest; bootstrap-aware prepare
trialmatchai e2e --from index --to match patient ingestion convenience
trialmatchai trec --from index --to eval (per track) official topics + qrels + corpus restriction
trialmatchai run --only match match already-staged profiles
trialmatchai index --only prepare,index

Every command is idempotent and resumable: re-running continues from the last completed work.

Reports

Matching writes a self-contained, offline HTML report (no server, no build step, no CDN). It is emitted automatically at the end of a run — <output_dir>/<patient_id>/report.html per patient and <output_dir>/index.html as a front page across all patients — and can be regenerated from existing results without re-matching:

trialmatchai report --patient 1009     # one patient        -> <output_dir>/1009/report.html
trialmatchai report --all              # unified front page -> <output_dir>/index.html

Auto-emit is gated by reporting.emit_html (default true; disabled for trec benchmark sweeps).

Python API

from trialmatchai.config.config_loader import load_config
from trialmatchai.pipeline import StageContext, run_pipeline

ctx = StageContext(config=load_config(), inputs=["patient.txt"])
run_pipeline(ctx, from_stage="index", to_stage="match")

See the API reference for StageContext, Stage, select_stages, and run_pipeline.