API reference¶
Auto-generated from docstrings. The pipeline is the primary public API; the orchestration stages and evaluation metrics are documented for programmatic use.
Pipeline¶
trialmatchai.pipeline ¶
The single TrialMatchAI pipeline: an ordered registry of idempotent stages.
Every command is a slice of this one pipeline. Each stage wraps an
already-idempotent orchestration function (it internally skips work that is done),
so the driver only decides which stages to run from the user's selection
(--only / --skip / --from / --to) and which to force (--force).
Because each stage is idempotent, running the whole pipeline from any starting state "just works": finished stages are cheap no-ops, unfinished ones run. That is the "one e2e workflow, maximally modular, never redo finished work" contract — a stage is the unit of modularity, and the e2e run is simply "run every stage".
STAGES
module-attribute
¶
STAGES = (
Stage(
"prepare",
_run_prepare,
"embed + entity-annotate the trial corpus",
),
Stage(
"concepts",
_run_concepts,
"build the entity-linking concept store",
),
Stage(
"link",
_run_link,
"link extracted entities to concept IDs (idempotent)",
),
Stage(
"index",
_run_index,
"build the LanceDB search tables",
),
Stage(
"ingest",
_run_ingest,
"import patient inputs into canonical profiles",
),
Stage(
"expand",
_run_expand,
"CoT query expansion of patient summaries",
),
Stage(
"match",
_run_match,
"retrieval + reranking + CoT eligibility + ranking",
),
Stage(
"eval",
_run_eval,
"score results against qrels (benchmark runs)",
),
)
StageContext
dataclass
¶
Everything the stages need, resolved once and threaded through the run.
Source code in src/trialmatchai/pipeline.py
Stage
dataclass
¶
select_stages ¶
Resolve the user's selection into an ordered list of stages to run.
Source code in src/trialmatchai/pipeline.py
run_pipeline ¶
Run the selected pipeline slice, freeing GPU models once at the end.
Source code in src/trialmatchai/pipeline.py
Orchestration stages¶
trialmatchai.orchestration ¶
Idempotent end-to-end orchestration for TrialMatchAI.
Chains the three pipeline stages — ingest patient inputs, build the search index, run matching — and skips work that is already done:
- ingest: a patient is skipped if its canonical profile already exists.
- index: a stage is skipped if the search tables already exist.
- match: a patient is skipped if it already has a non-empty ranked_trials.json.
Both the general trialmatchai e2e command and the TREC preset are thin
wrappers over these stages, so idempotency behaves identically everywhere.
ingest_inputs ¶
Import patient inputs (any supported format) into canonical profiles.
Skips a patient whose profile already exists unless force. Returns the
number of profiles available afterwards.
Source code in src/trialmatchai/orchestration.py
expand_queries ¶
Enrich each patient's matching summary via the CoT query expander.
No-op unless query_expansion.enabled. Loads the model once, enriches
every summary, then frees it before the match stage loads its own model.
Idempotent: a summary already marked query_expanded is skipped.
Source code in src/trialmatchai/orchestration.py
build_index ¶
build_index(
config,
*,
processed_trials_folder="data/processed_trials",
processed_criteria_folder="data/processed_criteria",
nct_filter=None,
force=False,
)
Build the LanceDB search tables, optionally restricted to nct_filter.
Skips when both tables already exist unless force. The backend (and thus
the target db path) comes from config['search_backend'].
Source code in src/trialmatchai/orchestration.py
run_matching ¶
Run the matching pipeline with per-patient resume.
When resuming, the expensive model stack is not even loaded if every patient is already done. The resume is additionally invalidated when the search index the matches were produced against has changed (a rebuilt corpus), so stale ranked_trials.json are not served after a re-index.
Source code in src/trialmatchai/orchestration.py
prepare_corpus ¶
prepare_corpus(
config,
*,
trials_json_folder,
processed_trials_folder,
processed_criteria_folder,
force=False,
log_every=500,
)
Embed + annotate normalized trial JSONs into processed_*; resumable.
Streams one trial at a time (bounded memory), skips trials already prepared so an interrupted build picks up where it left off, and isolates per-trial failures so one bad document cannot abort the whole corpus.
Source code in src/trialmatchai/orchestration.py
480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 | |
build_system ¶
build_system(
config,
*,
trials_json_folder=None,
processed_trials_folder="data/processed_trials",
processed_criteria_folder="data/processed_criteria",
force_prepare=False,
force_reindex=False,
link_concepts=False,
)
Run the setup half (prepare -> link -> index), idempotent, with a manifest.
Each stage is resumable and recorded in .trialmatchai_build.json next to
the processed data, so a disrupted build can be re-run and continues from the
last completed work.
Source code in src/trialmatchai/orchestration.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 | |
build_state ¶
build_state(
config,
*,
processed_trials_folder="data/processed_trials",
processed_criteria_folder="data/processed_criteria",
)
Report what the build half has produced — used by build --status.
Source code in src/trialmatchai/orchestration.py
Registry updater¶
trialmatchai.registry.updater ¶
RegistryUpdater ¶
Source code in src/trialmatchai/registry/updater.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | |
RegistryUpdateConfig
dataclass
¶
Source code in src/trialmatchai/registry/updater.py
RegistryUpdateReport
dataclass
¶
Source code in src/trialmatchai/registry/updater.py
Evaluation metrics¶
trialmatchai.trec.metrics ¶
Ranking-quality metrics for TREC evaluation.
These complement recall@k (the retrieval-side metric in qrels). nDCG here is:
- tie-aware (McSherry & Najork, 2008): trials sharing the same ranking score form a tie group, and each member is given the AVERAGE positional discount over the ranks the group spans (truncated at k). The result is the EXPECTED nDCG over all random orderings of the tied trials, so it is invariant to arbitrary tie-breaking — it rewards only genuinely ordering a more-relevant trial above a less-relevant one.
- condensed: computed over the labeled-and-retrieved trials only, with the IDCG normalized to that same set. It measures the quality of the final ranking of the trials the model actually evaluated, decoupled from recall.
Gain is linear (gain = relevance grade), matching trec_eval's default and the legacy evaluation.
ndcg_at_k ¶
Tie-aware nDCG@k. ordered_ids should be the condensed (labeled) list.
Source code in src/trialmatchai/trec/metrics.py
condensed_ndcg ¶
Tie-aware nDCG@k for each cutoff, condensed to labeled-and-retrieved trials.
ranked_ids is the final ranking order; grade_of is the qrels grade for
judged trials. Only trials present in grade_of are kept (condensed).
Source code in src/trialmatchai/trec/metrics.py
precision_at_k ¶
Standard binary P@k over the final ranked list (hard cutoff k).
Source code in src/trialmatchai/trec/metrics.py
trialmatchai.trec.qrels ¶
Official TREC relevance judgments (qrels): download, parse, corpus, metrics.
The per-track NCT corpus pool is derived directly from the qrels (the set of
judged trials) — replacing the previously-checked-in Unique_NCT_IDs lists.
Evaluation computes recall@k of the retrieval against the same qrels.
TREC Clinical Trials relevance grades: 0 = not relevant, 1 = excluded (the trial
matches the condition but the patient is excluded), 2 = eligible. By default a
trial counts as relevant at grade >= 1 (matching the legacy recall evaluation);
pass threshold=2 to score eligible-only.
parse_qrels ¶
Parse a TREC qrels file into {query_id: {nct_id: relevance}}.
Lines are <topic> <iteration> <nct_id> <relevance> (whitespace
separated). The query id is f"{id_prefix}{topic}" to match the imported
topic ids and the per-patient results folders.
Source code in src/trialmatchai/trec/qrels.py
corpus_ncts ¶
The judged-trial pool across all queries (used to restrict the index).
evaluate ¶
Per-query and mean metrics over the patients in results_dir.
Two complementary families
- recall@k — retrieval quality (first-level candidate list).
- tie-aware nDCG@{5,10,20} + P@10 — ranking quality of the final ranked_trials.json, condensed to labeled-and-retrieved trials. nDCG is order-invariant on ties (McSherry-Najork); P@10 is reported for both "relevant" (grade>=1) and "eligible" (grade==2).
Source code in src/trialmatchai/trec/qrels.py
HTML report¶
trialmatchai.interop.exporters.html_report ¶
Self-contained HTML results report for a matched patient.
Joins ranked_trials.json + the per-trial CoT eligibility evaluations + trial
metadata + the patient matching summary into one offline report.html (no
server, no build step). All dynamic content is embedded as a JSON island and
rendered client-side via safe DOM APIs, so there is no server-side HTML
templating to escape and no templating dependency.
build_report_model ¶
build_report_model(
*,
patient_summary,
ranked,
eligibility_by_id,
meta_by_id,
cot_by_id=None,
generated_at,
run_info=None,
)
Pure join of a patient's result artifacts into a render-ready model.
No I/O — the caller supplies already-loaded data, so this is unit-testable.
Trials keep ranked_trials.json order; rank is the 1-based position.
Source code in src/trialmatchai/interop/exporters/html_report.py
profile_to_model ¶
profile_to_model(
patient_dir,
*,
summary_dir=None,
trial_meta_folders=None,
generated_at=None,
run_info=None,
)
Read a patient's result dir into a render-ready model (no HTML).
patient_dir is <output_dir>/<patient_id>/. Metadata folders are tried
in order (processed_trials, then trials_jsons); ids with no metadata (e.g.
Dutch NL… registry trials) degrade to id + score + verdict only.
Source code in src/trialmatchai/interop/exporters/html_report.py
profile_to_html_report ¶
profile_to_html_report(
patient_dir,
*,
summary_dir=None,
trial_meta_folders=None,
generated_at=None,
run_info=None,
)
Read a patient's result dir and return a self-contained single-patient report.
Source code in src/trialmatchai/interop/exporters/html_report.py
render_unified_html ¶
One self-contained report over many patients: a front page listing every patient that drills into the per-patient view client-side.
Source code in src/trialmatchai/interop/exporters/html_report.py
render_html_report ¶
Embed the model as a tag-safe JSON island in the static template.
Accepts a single-patient model ({"patient", "trials", ...}) or a unified
one ({"patients": [...]}). A single model is wrapped so the template
always reads DATA.patients and a one-patient report skips the front page.