Industry & compliance

Citation-Grounded AI for Medical Information and MSL Use

Umair Khan··9 min read
Medical AffairsGraphRAGMedical InformationCitationsMSL

A field medical team gets an unsolicited question from an oncologist: what's the evidence for using a particular targeted agent after progression on a first-generation TKI? The MSL needs an answer that's accurate, current, and citable, because anything they communicate has to be on-label, defensible, and traceable to the source. If they paste that question into a general-purpose chatbot, they might get a fluent paragraph. They might also get a fabricated trial name, a citation to a paper that doesn't exist, or a confident claim about a drug indication that was never approved.

For medical affairs, fluent and wrong is worse than no answer at all. The whole function exists to be the trustworthy scientific interface. That's why the question isn't "can AI write a good-sounding medical response," but "can the system show its work, and will it shut up when it doesn't actually know." Those two properties, provenance and refusal, are what separate a tool a medical affairs team can use from a liability.

What does citation-grounded AI mean for medical information?

Citation-grounded AI means every claim in a generated response points to a specific source document the system actually retrieved, with an inline citation a human can open and verify, rather than text the model produced from its training distribution. If a claim has no source behind it, a grounded system doesn't make the claim.

The contrast is with an ungrounded large language model, which generates plausible text from patterns it learned during training. It has no live connection to the literature, no way to distinguish a real trial from a statistically likely-sounding one, and no mechanism to tell you where a statement came from, because there is no "where." The fluency is real; the grounding is absent.

A grounded system inverts the order of operations. First it retrieves the actual evidence (papers, abstracts, drug labels, trial records) for the question. Then any generated text is constrained to what those retrieved sources support, and each statement carries the citation back to its source. The model's job shrinks from "know oncology" to "summarize these specific retrieved documents faithfully, and cite them." That's a job a language model can do reliably. Knowing all of oncology from memory is not.

UNMIRI's Engine 3 does literature Q&A this way: it retrieves from live PubMed and Europe PMC full text, ranks results using a Neo4j evidence graph carrying CIViC and ClinVar citations, and returns answers with inline citations and provenance attached to each. The user can open every source. Nothing is asserted without a pointer.

Why does refusal matter as much as accuracy?

A medical information tool that always answers is dangerous, because the cases where the evidence is thin are exactly the cases where a confident wrong answer does the most harm. A grounded system that refuses when retrieval comes back empty is doing the single most important thing right.

This is counterintuitive if you're used to consumer AI, where an answer for every prompt is the product. In medical affairs it's the opposite. The valuable behavior is the system saying, in effect, "the retrieved literature doesn't support a confident answer to this," and stopping there. That's not a failure. That's the system correctly identifying the edge of the evidence.

Ungrounded models almost never do this. Asked about a variant-drug combination that has no real evidence base, a general LLM will usually still produce something, because producing text is what it does. A grounded system has a natural off-switch: if retrieval returns nothing of substance, there's nothing to ground a claim on, so it declines rather than inventing. UNMIRI's Q&A is built to refuse on thin evidence instead of hallucinating, and for a medical affairs team that boundary is the whole point. You can trust a tool that knows what it doesn't know far more than one that always sounds sure.

How does GraphRAG improve grounding over vector search alone?

A typed evidence graph resolves entity identity before retrieval, so a question about EGFR L858R doesn't pull evidence for EGFR T790M just because the two appear in the same review. Vector similarity collapses that distinction; the graph preserves it, which is what makes the citations actually correct.

This is the same architectural point that governs the rest of UNMIRI's stack, and it matters just as much for literature Q&A. Two variants in the same gene can sit in adjacent sentences of a review article, score near-identical under cosine similarity, and point to completely different drugs. A retrieval layer that ranks purely on embedding similarity will happily return T790M evidence for an L858R question. The citation will look real. It will be attached to the wrong claim.

GraphRAG fixes this by resolving the entities first. The gene, the variant, the drug, the tumor type are discrete nodes with typed relationships, anchored to public knowledge-base identifiers (CIViC entries, ClinVar records). Retrieval starts from the right node and traverses to the literature attached to it, so the evidence that surfaces is evidence about the thing actually asked. The full argument for why this matters in oncology is in the vector-RAG post; the short version is that in a field where one amino acid changes the drug, similarity is not identity, and citations are only useful if they're attached to the right entity.

So the pipeline is: live PubMed and Europe PMC retrieval for breadth and recency, ranked and disambiguated through the Neo4j evidence graph for correctness, with CIViC and ClinVar providing the curated, identifier-anchored backbone. Each returned answer carries inline citations and a provenance trail. The evidence tiers, where they apply, use the public AMP/ASCO/CAP 2017 framework (I-A, I-B, II-C, II-D, III, IV), so the grading is a published standard a reviewer can check, not a proprietary black box.

How does this fit a medical information inquiry workflow?

A grounded system drafts a citation-backed response to an inquiry, and a human reviews and approves it before anything is sent. The AI accelerates the literature work; it never auto-sends, and it never substitutes for medical review.

Medical information at an oncology biotech runs on inquiries: an HCP asks a question, the team responds with an accurate, on-label, referenced answer, and the whole exchange is logged. The slow part is the literature work, finding the right papers, pulling the right evidence, assembling the citations. That's where grounded AI earns its place.

UNMIRI's medical-information inquiry workflow drafts a response that's citation-grounded from the start, with every supporting source attached. Then a person reviews it. The draft is always human-review-only and is never auto-sent. That design choice is deliberate and non-negotiable: the AI does the retrieval and the first-draft assembly, the medical reviewer does the judgment, and the audit trail captures both. It compresses the time to a good response without removing the human accountability that the function requires.

The same posture extends to drug-safety context. When a response needs label information, UNMIRI pulls boxed warnings, indications, and warnings from openFDA, and can surface FAERS real-world adverse-event signal counts with the caveats stated plainly: these are spontaneous reports, subject to reporting bias, with no denominator, so a signal count is not an incidence rate and not a causal claim. Presenting that honestly, with the limitations attached, is itself a form of grounding. The alternative (a clean-looking number with no caveat) is the kind of confident-but-misleading output the whole architecture exists to prevent.

What should a medical affairs leader ask a vendor?

Ask three things: can every claim in a response be traced to a specific source document, what does the system do when the evidence is thin, and does anything get sent without human review. A vendor whose answer to the first is "the model knows" is selling fluency, not grounding.

Those three questions cut through most of the AI pitch in this space. Traceability tells you whether the citations are real and correctly attached. Refusal behavior tells you whether the system respects the edge of the evidence. The human-review boundary tells you whether the vendor understands that medical affairs cannot delegate accountability to a model. If the answers are clean citations on every claim, an explicit refusal path, and human-in-the-loop on everything that goes out, you're looking at a tool built for this work. If not, you're looking at a chatbot with a medical theme.

Engine 3, UNMIRI's literature-intelligence product, is built to pass that test: grounded retrieval over live PubMed and Europe PMC, disambiguated through a Neo4j evidence graph on CIViC and ClinVar, inline citations on every answer, refusal on thin evidence, and human-review-only drafting throughout. It's in early access, and we're precise about that. For a medical affairs team that needs AI it can actually defend, grounding isn't a feature. It's the prerequisite.

Related references

Frequently asked questions

What does citation-grounded AI mean for medical information?
Every claim in a generated response points to a specific source document the system actually retrieved, with an inline citation a human can open and verify, rather than text produced from a model's training distribution. If a claim has no source behind it, a grounded system doesn't make the claim. UNMIRI's Engine 3 retrieves from live PubMed and Europe PMC, ranks through a Neo4j evidence graph carrying CIViC and ClinVar citations, and attaches provenance to every answer.
Why should a medical information tool refuse to answer?
Because the cases where the evidence is thin are exactly the cases where a confident wrong answer does the most harm. An ungrounded model almost always produces something, even with no real evidence base. A grounded system has a natural off-switch: if retrieval returns nothing of substance, there's nothing to ground a claim on, so it declines rather than inventing. For medical affairs, a tool that knows what it doesn't know is more trustworthy than one that always sounds sure.
Does the AI send medical information responses automatically?
No. UNMIRI's medical-information inquiry workflow drafts a citation-grounded response with every supporting source attached, then a human reviews and approves before anything is sent. The draft is always human-review-only and is never auto-sent. The AI accelerates the literature work; the medical reviewer keeps the judgment and the accountability, and the audit trail captures both.
Umair Khan

Umair Khan

Founder and CTO, UNMIRI

Building UNMIRI, a precision oncology infrastructure company with four product surfaces: cross-vendor NGS interpretation, genomics-aware decision support, oncology literature intelligence, and a free cross-vendor unification tool for clinicians. Writing here on architecture, clinical data, and HIPAA-ready AI.

Clinical advisors: UNMIRI is in active conversations with multiple board-certified pathologists about formal advisory roles. Public introductions land on the About page once each engagement is formalized and the advisor approves being named.

Related posts

Want to see this architecture in your stack?

UNMIRI is in design-partner phase across the NGS Interpretation API, the Genomics-aware CDS API, the Literature Intelligence platform, and the free Pathologist Tool. Reply within one business day.