Clinical data & genomics

Mapping Oncology KOLs From Open Data: A Practical Guide

Umair Khan··8 min read
Medical AffairsKOLOpenAlexORCIDMSL

Every medical affairs team keeps a KOL list. The list is usually a spreadsheet, it's usually six months out of date, and it usually reflects who the team already knew rather than who's actually shaping the field right now. The person who published the resistance-mechanism paper that everyone's citing this quarter, the rising investigator running three trials in your subtype, the author whose ASCO abstract just reframed a sequencing question. They're often not on the list, because the list was built from relationships, not from data.

KOL mapping from public data flips that. Instead of starting from who you know, you start from the open record of who's actually doing the work, and you let the relationships follow. The data to do this well is public, attributable, and queryable.

What open data can you use to identify oncology KOLs?

The core public sources are OpenAlex for publications and citation counts, ORCID for disambiguated author identity, ROR for institution identity, ClinicalTrials.gov for trial principal-investigator activity, and congress abstract metadata for presentation footprint. Together they describe an author's scholarly and clinical output without any proprietary KOL database.

Each source answers a different question:

  • OpenAlex is an open catalog of scholarly works with author and citation data. It answers: what has this person published, how often is it cited, and on what topics. Citation counts are a blunt instrument on their own, but as one signal among several they tell you about durable influence versus a single lucky paper.
  • ORCID is the persistent identifier for a researcher. It solves the disambiguation problem that wrecks every naive KOL list: there are a lot of authors named J. Smith, and an ORCID tells you which one is which. If you've ever merged two people or split one person into two on a spreadsheet, ORCID is the fix.
  • ROR (Research Organization Registry) does the same for institutions. It maps "MD Anderson," "M.D. Anderson Cancer Center," and "UT MD Anderson" to one canonical identifier, so you can see an institution's full footprint instead of three fragments of it.
  • ClinicalTrials.gov tells you who's running trials, in what phase, in which indication, at which site. Trial PI activity is one of the strongest signals of clinical influence in oncology, and it's entirely public.
  • Congress abstract metadata, pulled via Crossref from journal supplements (ASCO in JCO, AACR in Cancer Research, ESMO in Annals of Oncology, ASH in Blood), tells you who's presenting where. Presentation footprint is influence made visible, and it updates several times a year. The mechanics of that feed are covered in the congress-intelligence post.

None of this requires a licensed KOL graph. It's all CC0 or public-API data, which means the work is verifiable. A medical affairs director can trace any ranking back to the underlying records, which matters when you're justifying a budget or an advisory-board invite.

How do you turn that data into an influence score?

An influence score combines several public signals (recent publication volume in the relevant area, citation weight, trial PI roles, congress presentation footprint, and recency) rather than leaning on any single metric. The point is a defensible, multi-factor view, not a vanity h-index.

A single number is dangerous if it's a black box, so the right approach is a transparent composite. UNMIRI's KOL surfacing scores authors from the open signals above, scoped to the gene, drug, tumor type, or mechanism you care about. The scoping is what makes it useful: a global "top oncologist" list is noise, but "the most active investigators in MET exon 14 NSCLC over the last 18 months" is a list a medical affairs team can act on.

A few design choices matter here:

  • Recency weighting. A landmark paper from 2014 matters, but a clinician who's published nothing in your area since then is a different kind of contact than one who presented at the last three congresses. The score should reflect current activity, not just lifetime achievement.
  • Topic scoping over global rank. Influence is local to a question. The right KOL for a HER2-low breast conversation is not the right KOL for a BRAF V600E melanoma conversation, even at the same institution.
  • Multiple signal types. A pure-citation ranking overweights review authors and senior PIs. Adding trial PI activity surfaces the people actually running the studies, and adding congress footprint surfaces the people actively presenting. Each signal corrects for the others' blind spots.
  • Transparency. Every score should decompose into its inputs. If an author ranks high, you should be able to see it's because of, say, four recent trials as PI plus two ASCO abstracts plus a heavily cited mechanism paper, each with a link back to the source record.

That last point is the difference between a tool a regulatory-minded medical affairs team will adopt and one they'll quietly distrust. A ranking you can't explain is a ranking you can't defend in front of compliance.

What does a KOL map actually do for an MSL team?

It powers two concrete workflows: pre-call preparation (a one-page brief on an investigator's recent publications, trials, and congress activity before an MSL meets them) and congress planning (which KOLs to prioritize at an upcoming meeting based on who's presenting and who's active in your area).

The pre-call pack is the everyday value. Before an MSL sits down with an investigator, they should know that investigator's last six months: what they've published, what trials they're running, where they presented, and which of your scientific topics intersect their work. Building that by hand is an hour of searching per call. Generating it from a live data layer is a click, and it's current rather than a snapshot from whenever someone last updated the deck.

Congress planning is the periodic value. Before ASCO or ESMO, a medical team wants to know which of the people presenting are high-priority for their asset, and which sessions to staff. A KOL map that ingests the congress abstract feed answers that directly, ranked and scoped, instead of leaving it to whoever happens to recognize names in the program.

And because the whole thing is built on identifiers (ORCID for people, ROR for institutions, NCT IDs for trials, DOIs for publications and abstracts), it stays clean over time. You're not re-deduplicating a spreadsheet every quarter. The identity layer does that work.

How does this avoid the licensing and privacy traps?

It uses only open, attributable public data about professional scholarly and clinical output. There's no PHI involved at any point, because medical affairs operates above the patient layer, and there's no dependency on a proprietary KOL graph that you can't audit or afford.

Two things to be precise about. First, privacy: KOL mapping is about public professional activity, the same kind of information that appears on a faculty page or a published author byline. No patient data touches it, by design, which is consistent with how UNMIRI treats every medical affairs workflow. The queries are gene, drug, tumor type, author, and institution. None of those are PHI.

Second, licensing: the established KOL-intelligence vendors run large proprietary graphs at six- and seven-figure annual costs, and you generally can't see how the rankings are produced. Building from OpenAlex, ORCID, ROR, ClinicalTrials.gov, and open congress metadata means the inputs are public and the method is inspectable. For a 50-to-500-person biotech that needs defensible KOL intelligence without an enterprise-scale contract, that's the difference between having this capability and not.

Putting it together

Start from the open record of who's doing the work, disambiguate people with ORCID and institutions with ROR, combine publication, citation, trial, and congress signals into a transparent score scoped to your scientific questions, and keep every number traceable to its source. That gives you a KOL map that's current, defensible, and yours to audit, feeding pre-call packs and congress planning instead of a stale spreadsheet.

Engine 3, UNMIRI's literature-intelligence product, builds this map on the same evidence graph that drives its congress and literature surveillance. It's in early access. If you're rebuilding a KOL list for the next planning cycle, this is the data layer worth building it on.

Related references

Frequently asked questions

What open data identifies oncology KOLs?
OpenAlex for publications and citation counts, ORCID for disambiguated author identity, ROR for institution identity, ClinicalTrials.gov for trial principal-investigator activity, and congress abstract metadata (pulled via Crossref from journal supplements) for presentation footprint. Together these describe an author's scholarly and clinical output without any proprietary KOL database, and every ranking traces back to a public, verifiable record.
How do you score KOL influence defensibly?
Combine several public signals rather than leaning on a single metric: recent publication volume in the relevant area, citation weight, trial PI roles, congress presentation footprint, and recency. Scope the score to a specific gene, drug, tumor type, or mechanism rather than producing a global rank, and decompose every score into its inputs so a medical affairs team can see and defend why an author ranks where they do.
Does KOL mapping involve PHI or licensed databases?
No to both. KOL mapping uses public professional data (the same kind of information on a faculty page or a published byline), so no patient data is involved, consistent with how UNMIRI treats every medical affairs workflow. And because it builds from OpenAlex, ORCID, ROR, ClinicalTrials.gov, and open congress metadata, it does not depend on a proprietary six- or seven-figure KOL graph that you can't audit.
Umair Khan

Umair Khan

Founder and CTO, UNMIRI

Building UNMIRI, a precision oncology infrastructure company with four product surfaces: cross-vendor NGS interpretation, genomics-aware decision support, oncology literature intelligence, and a free cross-vendor unification tool for clinicians. Writing here on architecture, clinical data, and HIPAA-ready AI.

Clinical advisors: UNMIRI is in active conversations with multiple board-certified pathologists about formal advisory roles. Public introductions land on the About page once each engagement is formalized and the advisor approves being named.

Related posts

Want to see this architecture in your stack?

UNMIRI is in design-partner phase across the NGS Interpretation API, the Genomics-aware CDS API, the Literature Intelligence platform, and the free Pathologist Tool. Reply within one business day.