API reference¶

Auto-generated from docstrings. The pipeline is the primary public API; the orchestration stages and evaluation metrics are documented for programmatic use.

Pipeline¶

trialmatchai.pipeline ¶

The single TrialMatchAI pipeline: an ordered registry of idempotent stages.

Every command is a slice of this one pipeline. Each stage wraps an already-idempotent orchestration function (it internally skips work that is done), so the driver only decides which stages to run from the user's selection (--only / --skip / --from / --to) and which to force (--force).

Because each stage is idempotent, running the whole pipeline from any starting state "just works": finished stages are cheap no-ops, unfinished ones run. That is the "one e2e workflow, maximally modular, never redo finished work" contract — a stage is the unit of modularity, and the e2e run is simply "run every stage".

STAGES `module-attribute` ¶

STAGES = (
    Stage(
        "prepare",
        _run_prepare,
        "embed + entity-annotate the trial corpus",
    ),
    Stage(
        "concepts",
        _run_concepts,
        "build the entity-linking concept store",
    ),
    Stage(
        "link",
        _run_link,
        "link extracted entities to concept IDs (idempotent)",
    ),
    Stage(
        "index",
        _run_index,
        "build the LanceDB search tables",
    ),
    Stage(
        "ingest",
        _run_ingest,
        "import patient inputs into canonical profiles",
    ),
    Stage(
        "expand",
        _run_expand,
        "CoT query expansion of patient summaries",
    ),
    Stage(
        "match",
        _run_match,
        "retrieval + reranking + CoT eligibility + ranking",
    ),
    Stage(
        "eval",
        _run_eval,
        "score results against qrels (benchmark runs)",
    ),
)

StageContext `dataclass` ¶

Everything the stages need, resolved once and threaded through the run.

Source code in src/trialmatchai/pipeline.py

@dataclass
class StageContext:
    """Everything the stages need, resolved once and threaded through the run."""

    config: dict[str, Any]
    trials_json_folder: Path | None = None
    processed_trials_folder: Path = Path("data/processed_trials")
    processed_criteria_folder: Path = Path("data/processed_criteria")
    inputs: list[str] = field(default_factory=list)
    input_format: str = "auto"
    with_entities: bool = True
    nct_filter: set[str] | None = None
    concepts: str | None = None  # "open" -> build the open concept store
    concept_csv: str | None = None
    synonym_csv: str | None = None
    qrels: dict | None = None  # provided by the TREC preset -> enables eval
    results_dir: Path | None = None
    force: set[str] = field(default_factory=set)

    def forced(self, name: str) -> bool:
        return name in self.force or "all" in self.force

Stage `dataclass` ¶

Source code in src/trialmatchai/pipeline.py

@dataclass(frozen=True)
class Stage:
    name: str
    run: Callable[[StageContext], None]
    help: str

select_stages ¶

select_stages(
    *, only=None, skip=(), from_stage=None, to_stage=None
)

Resolve the user's selection into an ordered list of stages to run.

Source code in src/trialmatchai/pipeline.py

def select_stages(
    *,
    only: Sequence[str] | None = None,
    skip: Sequence[str] = (),
    from_stage: str | None = None,
    to_stage: str | None = None,
) -> list[Stage]:
    """Resolve the user's selection into an ordered list of stages to run."""
    if only:
        _validate(only)
        chosen = set(only)
        return [s for s in STAGES if s.name in chosen]

    for endpoint in (from_stage, to_stage):
        if endpoint is not None:
            _validate([endpoint])
    _validate(skip)

    start = STAGE_NAMES.index(from_stage) if from_stage else 0
    end = STAGE_NAMES.index(to_stage) + 1 if to_stage else len(STAGES)
    if start > end - 1:
        raise ValueError(f"--from {from_stage} is after --to {to_stage}")
    skipped = set(skip)
    return [s for s in STAGES[start:end] if s.name not in skipped]

run_pipeline ¶

run_pipeline(
    ctx,
    *,
    only=None,
    skip=(),
    from_stage=None,
    to_stage=None,
)

Run the selected pipeline slice, freeing GPU models once at the end.

Source code in src/trialmatchai/pipeline.py

def run_pipeline(
    ctx: StageContext,
    *,
    only: Sequence[str] | None = None,
    skip: Sequence[str] = (),
    from_stage: str | None = None,
    to_stage: str | None = None,
) -> int:
    """Run the selected pipeline slice, freeing GPU models once at the end."""
    stages = select_stages(only=only, skip=skip, from_stage=from_stage, to_stage=to_stage)
    if not stages:
        logger.warning("No stages selected; nothing to do.")
        return 0
    logger.info("Pipeline: %s", " -> ".join(s.name for s in stages))
    try:
        for stage in stages:
            logger.info("================ stage: %s ================", stage.name)
            stage.run(ctx)
    finally:
        from trialmatchai.orchestration import free_models

        free_models()
    logger.info("Pipeline complete: %s", " -> ".join(s.name for s in stages))
    return 0

Orchestration stages¶

trialmatchai.orchestration ¶

Idempotent end-to-end orchestration for TrialMatchAI.

Chains the three pipeline stages — ingest patient inputs, build the search index, run matching — and skips work that is already done:

ingest: a patient is skipped if its canonical profile already exists.
index: a stage is skipped if the search tables already exist.
match: a patient is skipped if it already has a non-empty ranked_trials.json.

Both the general trialmatchai e2e command and the TREC preset are thin wrappers over these stages, so idempotency behaves identically everywhere.

ingest_inputs ¶

ingest_inputs(
    config,
    inputs,
    *,
    input_format="auto",
    with_entities=True,
    force=False,
)

Import patient inputs (any supported format) into canonical profiles.

Skips a patient whose profile already exists unless force. Returns the number of profiles available afterwards.