Clinical Research Series — Blog 4: Evidence Grading & Citation Quality#
Not all evidence is equal. A meta-analysis of 10,000 patients outweighs a single case report.
Ask a doctor about a treatment, and they won't just tell you "studies show it works." They'll tell you what kind of studies. A systematic review of 15 randomized controlled trials is fundamentally different from a single case report. Both are published in PubMed. Both have PMIDs. But one represents the highest level of medical evidence, and the other is an anecdote with a DOI.
Our Clinical Research Agent needed to understand this distinction. When it returns 10 articles about GLP-1 receptor agonists, the user needs to know which ones are meta-analyses (trust these), which are RCTs (strong evidence), and which are case reports (interesting but not conclusive). Without evidence grading, a list of PubMed articles is just a list. With it, it becomes a structured evidence synthesis.
The Clinical Research Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Architecture & Domain Analysis | System design, evidence hierarchy, medical APIs |
| 2 | PubMed & ClinicalTrials.gov Integration | E-utilities, v2 API, OpenFDA, circuit breakers |
| 3 | Safety & Hallucination Prevention | Uncited claim detection, credibility scoring |
| 4 | Evidence Grading & Citation Quality (this post) | Level I-V, PMID/NCT citations, journal tiers |
| 5 | Medical Frontend & Production Readiness | Teal theme, EvidencePanel, SafetyAlert |
The Evidence Hierarchy#
Evidence-Based Medicine (EBM) organizes research into a hierarchy based on study design. This isn't opinion — it's a well-established framework taught in every medical school. The idea is simple: study designs that reduce bias produce more reliable results.
Here's the hierarchy, from strongest to weakest:
| Level | Study Design | Why It's Strong (or Weak) | Badge Color |
|---|---|---|---|
| Level I | Systematic reviews, meta-analyses | Aggregates multiple studies, reduces individual study bias | Emerald |
| Level II | Randomized controlled trials (RCTs) | Random assignment eliminates selection bias | Blue |
| Level III | Non-randomized controlled studies | Controlled but susceptible to selection bias | Amber |
| Level IV | Case series, case reports | Descriptive only, no control group | Orange |
| Level V | Expert opinion, editorials | No empirical data, authority-based | Gray |
A Level I meta-analysis that pools 10 RCTs with 10,000 total patients tells you something very different from a Level IV case report about 3 patients. Both might support the same conclusion, but the meta-analysis provides dramatically stronger evidence. The badge colors in our frontend make this hierarchy instantly visible — emerald for the best evidence, gray for the weakest.
The EvidenceLevel Enum#
The grading system starts with a Python enum and a dataclass that pairs each level with its display metadata:
from enum import Enum from dataclasses import dataclass class EvidenceLevel(Enum): LEVEL_I = "I" # Systematic reviews, meta-analyses LEVEL_II = "II" # Randomized controlled trials (RCTs) LEVEL_III = "III" # Non-randomized controlled studies LEVEL_IV = "IV" # Case series, case reports LEVEL_V = "V" # Expert opinion, editorials, reviews @dataclass class EvidenceGrade: level: EvidenceLevel label: str description: str color: str # For frontend badge
The color field is intentional. Every EvidenceGrade carries its own badge color because the frontend needs to render evidence levels as colored badges. By embedding the color in the data model rather than mapping it in the frontend, we keep the single source of truth in the backend. The frontend just reads article.evidence_level and renders the corresponding badge — no translation layer needed.
The grade info dictionary maps each level to its full display data:
_GRADE_INFO = { EvidenceLevel.LEVEL_I: EvidenceGrade( level=EvidenceLevel.LEVEL_I, label="Level I", description="Systematic review or meta-analysis", color="emerald", ), EvidenceLevel.LEVEL_II: EvidenceGrade( level=EvidenceLevel.LEVEL_II, label="Level II", description="Randomized controlled trial", color="blue", ), EvidenceLevel.LEVEL_III: EvidenceGrade( level=EvidenceLevel.LEVEL_III, label="Level III", description="Non-randomized controlled study", color="amber", ), EvidenceLevel.LEVEL_IV: EvidenceGrade( level=EvidenceLevel.LEVEL_IV, label="Level IV", description="Case series or case report", color="orange", ), EvidenceLevel.LEVEL_V: EvidenceGrade( level=EvidenceLevel.LEVEL_V, label="Level V", description="Expert opinion or editorial", color="gray", ), }
Publication Type Mapping#
PubMed articles come with a PublicationType field in their XML metadata. This is the primary signal for evidence grading. A publication type of "Meta-Analysis" maps directly to Level I. A publication type of "Randomized Controlled Trial" maps to Level II.
The mapping uses Python sets for O(1) lookup:
_LEVEL_I_TYPES = { "meta-analysis", "systematic review", "systematic reviews", "meta analysis", "cochrane review", "umbrella review", } _LEVEL_II_TYPES = { "randomized controlled trial", "rct", "clinical trial, phase iii", "clinical trial, phase iv", "multicenter study", "pragmatic clinical trial", } _LEVEL_III_TYPES = { "controlled clinical trial", "observational study", "cohort study", "comparative study", "clinical trial, phase ii", "non-randomized", "cross-sectional study", "prospective study", "retrospective study", } _LEVEL_IV_TYPES = { "case reports", "case report", "case series", "clinical trial, phase i", } _LEVEL_V_TYPES = { "editorial", "comment", "letter", "review", "guideline", "practice guideline", "expert opinion", "consensus", }
A few things to note about the mapping:
Phase III/IV trials are Level II. By Phase III, a drug is being tested in large randomized populations. Phase IV is post-market surveillance with randomized methodology. Both qualify as RCT-level evidence.
Phase II trials are Level III. Phase II trials test efficacy in a moderate-sized group but often lack full randomization or use single-arm designs. They're controlled but not at RCT rigor.
Phase I trials are Level IV. Phase I trials test safety in small groups (10-30 patients). They're closer to case series than controlled studies.
"Review" is Level V, not Level I. This is a common confusion. A "review" in PubMed is a narrative review — one author summarizing a topic. A "systematic review" follows a rigorous methodology with defined search criteria, inclusion/exclusion rules, and often a PRISMA flowchart. They are fundamentally different, and the evidence hierarchy reflects that.
The grade_evidence() Function#
The grading function checks publication types first, then falls back to keyword matching in the title and abstract:
def grade_evidence( publication_types: list[str], title: str = "", abstract: str = "" ) -> EvidenceGrade: """Grade evidence level based on publication type and content.""" pub_types_lower = {pt.lower() for pt in publication_types} text_lower = f"{title} {abstract}".lower() # Check in order of evidence strength if (pub_types_lower & _LEVEL_I_TYPES or "meta-analysis" in text_lower or "systematic review" in text_lower): level = EvidenceLevel.LEVEL_I elif (pub_types_lower & _LEVEL_II_TYPES or "randomized" in text_lower): level = EvidenceLevel.LEVEL_II elif (pub_types_lower & _LEVEL_III_TYPES or "cohort" in text_lower or "observational" in text_lower): level = EvidenceLevel.LEVEL_III elif (pub_types_lower & _LEVEL_IV_TYPES or "case report" in text_lower): level = EvidenceLevel.LEVEL_IV elif pub_types_lower & _LEVEL_V_TYPES: level = EvidenceLevel.LEVEL_V else: # Default to Level V if unknown level = EvidenceLevel.LEVEL_V return _GRADE_INFO[level]
Why Check in Order of Strength?#
The function checks from Level I down to Level V. This matters because an article can have multiple publication types. A paper might be tagged as both "Meta-Analysis" and "Review." By checking Level I first, we assign it the strongest applicable level. This is the correct behavior — a meta-analysis that's also a review should be graded as Level I, not Level V.
The Keyword Fallback#
Not every PubMed article has accurate publication type metadata. Some articles are clearly meta-analyses — with "meta-analysis" right in the title — but their PublicationType field only says "Journal Article." The keyword fallback catches these:
if "meta-analysis" in text_lower or "systematic review" in text_lower: level = EvidenceLevel.LEVEL_I
This is a pragmatic concession. In an ideal world, we'd rely solely on structured metadata. In reality, PubMed's publication type tagging is inconsistent, especially for older articles or non-English journals. The keyword fallback provides a safety net that catches the obvious cases.
The same logic applies down the hierarchy: "randomized" in the text suggests Level II, "cohort" or "observational" suggests Level III, "case report" suggests Level IV. These aren't perfect heuristics, but they're far better than defaulting everything to Level V.
The Level V Default#
If neither publication types nor keywords match any level, the function defaults to Level V. This is deliberate conservatism — if we can't determine the evidence level, we assume the weakest. In medicine, overestimating evidence quality is far more dangerous than underestimating it. A Level I badge on a Level IV study could mislead a clinician. A Level V badge on a Level II study is inconvenient but safe.
Medical Credibility Scoring#
Evidence level tells you about the study design. Credibility scoring tells you about the source. A meta-analysis published in The Lancet is more credible than one posted on a preprint server. Both might be Level I evidence, but they have very different levels of peer-review scrutiny.
The credibility scoring system uses a tiered approach:
# High-impact medical journals _TIER1_JOURNALS = { "the lancet", "new england journal of medicine", "nejm", "jama", "bmj", "nature medicine", "nature", "science", "cell", "annals of internal medicine", "circulation", "journal of clinical oncology", "the lancet oncology", "lancet infectious diseases", "plos medicine", } # Government and institutional sources _TIER1_DOMAINS = { "nih.gov", "cdc.gov", "who.int", "fda.gov", "clinicaltrials.gov", "cochranelibrary.com", "ncbi.nlm.nih.gov", "pubmed.ncbi.nlm.nih.gov", } # Preprint servers _PREPRINT_DOMAINS = { "medrxiv.org", "biorxiv.org", "arxiv.org", "preprints.org", }
The Scoring Function#
The score_medical_credibility function returns a score from 0-100 along with a tier, reasoning, and recommendation:
@tool def score_medical_credibility( url: str, source_type: str, journal: str = "", domain: str = "" ) -> dict: """Score the credibility of a medical source.""" journal_lower = journal.lower() if journal else "" # PubMed articles — high baseline credibility if source_type == "pubmed": if any(j in journal_lower for j in _TIER1_JOURNALS): return _result(url, 98, "tier1_journal", "High-impact peer-reviewed journal", "Safe to cite") return _result(url, 90, "peer_reviewed", "Peer-reviewed article indexed in PubMed", "Safe to cite") # Clinical trial registries if source_type == "clinicaltrials": return _result(url, 92, "trial_registry", "Registered clinical trial on ClinicalTrials.gov", "Safe to cite") # FDA data if source_type == "openfda": return _result(url, 95, "government", "Official FDA adverse event data", "Safe to cite") # Preprints — moderate credibility with warning if source_type == "preprint" or any( d in domain_lower for d in _PREPRINT_DOMAINS ): return _result(url, 55, "preprint", "Preprint — not yet peer-reviewed", "Cite with caution — not peer-reviewed") # Unknown web sources return _result(url, 40, "unverified", "Unverified web source", "Use caution — verify independently")
The Score Tiers#
| Source | Score | Tier | Recommendation |
|---|---|---|---|
| NEJM, Lancet, JAMA, BMJ | 98 | tier1_journal | Safe to cite |
| Any PubMed article | 90 | peer_reviewed | Safe to cite |
| FDA data (OpenFDA) | 95 | government | Safe to cite |
| ClinicalTrials.gov | 92 | trial_registry | Safe to cite |
| Academic (.edu) | 80 | academic | Safe to cite |
| Mayo Clinic, UpToDate | 70 | medical_site | Requires context |
| medRxiv, bioRxiv | 55 | preprint | Cite with caution |
| Unknown web source | 40 | unverified | Verify independently |
The gap between 90 (general PubMed) and 98 (tier-1 journal) is intentional. A PubMed-indexed article in the Journal of Obscure Regional Medicine is still peer-reviewed — that's worth a 90. But an article in the New England Journal of Medicine has survived the most rigorous peer review in medicine. That extra 8 points reflects the difference in editorial standards.
The preprint score of 55 is deliberately uncomfortable. Preprints are important — especially for emerging research — but they haven't been peer-reviewed. The score and the "Cite with caution" recommendation ensure that preprints are never presented with the same authority as peer-reviewed work.
Citation Format: Making Evidence Verifiable#
Every citation in the Clinical Research Agent is a clickable link. This isn't just good UX — it's a fundamental requirement for trust. A medical claim without a verifiable citation is just an assertion.
PMID Citations#
PubMed articles use PMID (PubMed Identifier) links:
PMID: 38547893 → https://pubmed.ncbi.nlm.nih.gov/38547893/
The URL format is stable and predictable. Any valid PMID can be turned into a direct link to the article's PubMed page, which includes the title, authors, abstract, journal, publication date, and links to the full text. This is the gold standard for medical citations — a single identifier that resolves to a complete bibliographic record.
In our frontend, PMID links render as clickable badges:
<a href={`https://pubmed.ncbi.nlm.nih.gov/${article.pmid}/`} target="_blank" rel="noopener noreferrer" className="inline-flex items-center gap-1 rounded-full border border-border bg-card px-2 py-0.5 text-[10px] font-mono text-primary hover:border-primary/40 transition-colors" > PMID: {article.pmid} <ExternalLink className="h-2.5 w-2.5" /> </a>
Every PMID badge opens in a new tab. The noopener noreferrer attributes are security best practices for external links. The monospace font signals that this is a technical identifier, not prose.
NCT ID Citations#
Clinical trials use NCT IDs (National Clinical Trial identifiers):
NCT05553626 → https://clinicaltrials.gov/study/NCT05553626
The NCT ID format follows ClinicalTrials.gov's v2 URL structure. Like PMIDs, these are stable identifiers that resolve to full trial records including the protocol, eligibility criteria, endpoints, enrollment status, and results (if available).
<a href={`https://clinicaltrials.gov/study/${trial.nct_id}`} target="_blank" rel="noopener noreferrer" className="inline-flex items-center gap-1 rounded-full border border-border bg-card px-2 py-0.5 text-[10px] font-mono text-primary hover:border-primary/40 transition-colors" > {trial.nct_id} <ExternalLink className="h-2.5 w-2.5" /> </a>
Why Clickable Links Matter#
A non-clickable citation is a suggestion. A clickable citation is an invitation to verify. When a doctor reads "Level I meta-analysis (PMID: 38547893)" and can click through to the actual article in 2 seconds, they can confirm the claim themselves. When the PMID is just text, verification requires copying the ID, opening PubMed, and pasting it into the search bar. That small friction is the difference between citations that get checked and citations that get trusted on faith.
In medicine, trust on faith is how people get hurt.
Real Evidence from Our Tests#
When we query "GLP-1 receptor agonists for Type 2 Diabetes with cardiovascular disease," the evidence grader classifies the returned articles automatically. Here's what a real result set looks like:
| Article | Journal | Evidence Level | Credibility |
|---|---|---|---|
| "GLP-1 receptor agonists for type 2 diabetes — systematic review and meta-analysis" | BMJ | Level I | 98 |
| "Cardiovascular outcomes with GLP-1 receptor agonists: a meta-analysis of randomized clinical trials" | Diabetologia | Level I | 90 |
| "Semaglutide and cardiovascular outcomes in patients with type 2 diabetes" | NEJM | Level II | 98 |
| "Cohort study of GLP-1 RA adherence and cardiovascular events" | Diabetes Care | Level III | 90 |
| "Expert consensus on GLP-1 RA use in clinical practice" | Lancet Diabetes & Endocrinology | Level V | 98 |
The first two articles — both meta-analyses — get emerald Level I badges and high credibility scores. The BMJ article scores 98 because BMJ is a tier-1 journal. The Diabetologia article scores 90 as a standard PubMed article. The NEJM article is Level II (an RCT) but scores 98 for credibility because NEJM is tier-1. The expert consensus is Level V (the weakest evidence level) despite being published in a tier-1 journal — because evidence level and credibility are orthogonal dimensions.
This separation matters. A Level V expert opinion in The Lancet is authoritative but not empirical. A Level I meta-analysis in a regional journal is empirical but may have less editorial scrutiny. The two scores together give a more nuanced picture than either alone.
Putting It Together: The Evidence Pipeline#
When an article arrives from PubMed, it flows through both scoring systems:
PubMed article
├── publication_types: ["Meta-Analysis", "Journal Article"]
├── title: "Systematic review of..."
├── journal: "BMJ"
└── pmid: "38547893"
│
├── grade_evidence(publication_types, title)
│ └── Level I (emerald badge)
│
└── score_medical_credibility(url, "pubmed", journal="BMJ")
└── Score: 98, Tier: tier1_journal
The frontend receives both scores and renders them together — the evidence level badge and the credibility bar appear on every article card. A user scanning the evidence panel can instantly see: "This is a Level I meta-analysis from a top-tier journal with 98% credibility" or "This is a Level IV case report from a general PubMed journal with 90% credibility."
The combination eliminates ambiguity. Without evidence grading, 10 articles look the same. Without credibility scoring, a preprint looks the same as a Lancet publication. With both, the user has the information they need to weigh the evidence — which is the entire point of evidence-based medicine.
What's Next#
Evidence grading tells you how strong each piece of evidence is. Credibility scoring tells you how much to trust the source. But the user still needs to see all of this — badges, bars, citations, trial cards, safety alerts — in an interface designed for medical research.
In Blog 5, we build the medical frontend: evidence panel with Level I-V badges, clinical trial cards with phase and status indicators, safety alert banners, and the disclaimer system that makes the whole thing responsible to deploy.
This is post 4 of 5 in the Clinical Research Series. The full series covers architecture, data sources, safety, evidence grading, and frontend deployment for a medical AI research tool.
All code is open source: github.com/MinhQuanBuiSco/clinical-research-agent