Back
AI Engineering··14 min

Clinical Research Series — Blog 4: Evidence Grading & Citation Quality

Not all evidence is equal. A meta-analysis of 10,000 patients outweighs a single case report. Build an evidence grading system (Level I-V) with PMID citation verification and journal credibility scoring.

Clinical Research Series — Blog 4: Evidence Grading & Citation Quality#

Not all evidence is equal. A meta-analysis of 10,000 patients outweighs a single case report.

Ask a doctor about a treatment, and they won't just tell you "studies show it works." They'll tell you what kind of studies. A systematic review of 15 randomized controlled trials is fundamentally different from a single case report. Both are published in PubMed. Both have PMIDs. But one represents the highest level of medical evidence, and the other is an anecdote with a DOI.

Our Clinical Research Agent needed to understand this distinction. When it returns 10 articles about GLP-1 receptor agonists, the user needs to know which ones are meta-analyses (trust these), which are RCTs (strong evidence), and which are case reports (interesting but not conclusive). Without evidence grading, a list of PubMed articles is just a list. With it, it becomes a structured evidence synthesis.


The Clinical Research Series#

PartTitleFocus
1Architecture & Domain AnalysisSystem design, evidence hierarchy, medical APIs
2PubMed & ClinicalTrials.gov IntegrationE-utilities, v2 API, OpenFDA, circuit breakers
3Safety & Hallucination PreventionUncited claim detection, credibility scoring
4Evidence Grading & Citation Quality (this post)Level I-V, PMID/NCT citations, journal tiers
5Medical Frontend & Production ReadinessTeal theme, EvidencePanel, SafetyAlert

The Evidence Hierarchy#

Evidence-Based Medicine (EBM) organizes research into a hierarchy based on study design. This isn't opinion — it's a well-established framework taught in every medical school. The idea is simple: study designs that reduce bias produce more reliable results.

Here's the hierarchy, from strongest to weakest:

LevelStudy DesignWhy It's Strong (or Weak)Badge Color
Level ISystematic reviews, meta-analysesAggregates multiple studies, reduces individual study biasEmerald
Level IIRandomized controlled trials (RCTs)Random assignment eliminates selection biasBlue
Level IIINon-randomized controlled studiesControlled but susceptible to selection biasAmber
Level IVCase series, case reportsDescriptive only, no control groupOrange
Level VExpert opinion, editorialsNo empirical data, authority-basedGray

A Level I meta-analysis that pools 10 RCTs with 10,000 total patients tells you something very different from a Level IV case report about 3 patients. Both might support the same conclusion, but the meta-analysis provides dramatically stronger evidence. The badge colors in our frontend make this hierarchy instantly visible — emerald for the best evidence, gray for the weakest.


The EvidenceLevel Enum#

The grading system starts with a Python enum and a dataclass that pairs each level with its display metadata:

from enum import Enum
from dataclasses import dataclass

class EvidenceLevel(Enum):
    LEVEL_I = "I"      # Systematic reviews, meta-analyses
    LEVEL_II = "II"    # Randomized controlled trials (RCTs)
    LEVEL_III = "III"  # Non-randomized controlled studies
    LEVEL_IV = "IV"    # Case series, case reports
    LEVEL_V = "V"      # Expert opinion, editorials, reviews

@dataclass
class EvidenceGrade:
    level: EvidenceLevel
    label: str
    description: str
    color: str  # For frontend badge

The color field is intentional. Every EvidenceGrade carries its own badge color because the frontend needs to render evidence levels as colored badges. By embedding the color in the data model rather than mapping it in the frontend, we keep the single source of truth in the backend. The frontend just reads article.evidence_level and renders the corresponding badge — no translation layer needed.

The grade info dictionary maps each level to its full display data:

_GRADE_INFO = {
    EvidenceLevel.LEVEL_I: EvidenceGrade(
        level=EvidenceLevel.LEVEL_I,
        label="Level I",
        description="Systematic review or meta-analysis",
        color="emerald",
    ),
    EvidenceLevel.LEVEL_II: EvidenceGrade(
        level=EvidenceLevel.LEVEL_II,
        label="Level II",
        description="Randomized controlled trial",
        color="blue",
    ),
    EvidenceLevel.LEVEL_III: EvidenceGrade(
        level=EvidenceLevel.LEVEL_III,
        label="Level III",
        description="Non-randomized controlled study",
        color="amber",
    ),
    EvidenceLevel.LEVEL_IV: EvidenceGrade(
        level=EvidenceLevel.LEVEL_IV,
        label="Level IV",
        description="Case series or case report",
        color="orange",
    ),
    EvidenceLevel.LEVEL_V: EvidenceGrade(
        level=EvidenceLevel.LEVEL_V,
        label="Level V",
        description="Expert opinion or editorial",
        color="gray",
    ),
}

Publication Type Mapping#

PubMed articles come with a PublicationType field in their XML metadata. This is the primary signal for evidence grading. A publication type of "Meta-Analysis" maps directly to Level I. A publication type of "Randomized Controlled Trial" maps to Level II.

The mapping uses Python sets for O(1) lookup:

_LEVEL_I_TYPES = {
    "meta-analysis", "systematic review", "systematic reviews",
    "meta analysis", "cochrane review", "umbrella review",
}
_LEVEL_II_TYPES = {
    "randomized controlled trial", "rct", "clinical trial, phase iii",
    "clinical trial, phase iv", "multicenter study", "pragmatic clinical trial",
}
_LEVEL_III_TYPES = {
    "controlled clinical trial", "observational study", "cohort study",
    "comparative study", "clinical trial, phase ii", "non-randomized",
    "cross-sectional study", "prospective study", "retrospective study",
}
_LEVEL_IV_TYPES = {
    "case reports", "case report", "case series", "clinical trial, phase i",
}
_LEVEL_V_TYPES = {
    "editorial", "comment", "letter", "review", "guideline",
    "practice guideline", "expert opinion", "consensus",
}

A few things to note about the mapping:

Phase III/IV trials are Level II. By Phase III, a drug is being tested in large randomized populations. Phase IV is post-market surveillance with randomized methodology. Both qualify as RCT-level evidence.

Phase II trials are Level III. Phase II trials test efficacy in a moderate-sized group but often lack full randomization or use single-arm designs. They're controlled but not at RCT rigor.

Phase I trials are Level IV. Phase I trials test safety in small groups (10-30 patients). They're closer to case series than controlled studies.

"Review" is Level V, not Level I. This is a common confusion. A "review" in PubMed is a narrative review — one author summarizing a topic. A "systematic review" follows a rigorous methodology with defined search criteria, inclusion/exclusion rules, and often a PRISMA flowchart. They are fundamentally different, and the evidence hierarchy reflects that.


The grade_evidence() Function#

The grading function checks publication types first, then falls back to keyword matching in the title and abstract:

def grade_evidence(
    publication_types: list[str],
    title: str = "",
    abstract: str = ""
) -> EvidenceGrade:
    """Grade evidence level based on publication type and content."""
    pub_types_lower = {pt.lower() for pt in publication_types}
    text_lower = f"{title} {abstract}".lower()

    # Check in order of evidence strength
    if (pub_types_lower & _LEVEL_I_TYPES
        or "meta-analysis" in text_lower
        or "systematic review" in text_lower):
        level = EvidenceLevel.LEVEL_I
    elif (pub_types_lower & _LEVEL_II_TYPES
          or "randomized" in text_lower):
        level = EvidenceLevel.LEVEL_II
    elif (pub_types_lower & _LEVEL_III_TYPES
          or "cohort" in text_lower
          or "observational" in text_lower):
        level = EvidenceLevel.LEVEL_III
    elif (pub_types_lower & _LEVEL_IV_TYPES
          or "case report" in text_lower):
        level = EvidenceLevel.LEVEL_IV
    elif pub_types_lower & _LEVEL_V_TYPES:
        level = EvidenceLevel.LEVEL_V
    else:
        # Default to Level V if unknown
        level = EvidenceLevel.LEVEL_V

    return _GRADE_INFO[level]

Why Check in Order of Strength?#

The function checks from Level I down to Level V. This matters because an article can have multiple publication types. A paper might be tagged as both "Meta-Analysis" and "Review." By checking Level I first, we assign it the strongest applicable level. This is the correct behavior — a meta-analysis that's also a review should be graded as Level I, not Level V.

The Keyword Fallback#

Not every PubMed article has accurate publication type metadata. Some articles are clearly meta-analyses — with "meta-analysis" right in the title — but their PublicationType field only says "Journal Article." The keyword fallback catches these:

if "meta-analysis" in text_lower or "systematic review" in text_lower:
    level = EvidenceLevel.LEVEL_I

This is a pragmatic concession. In an ideal world, we'd rely solely on structured metadata. In reality, PubMed's publication type tagging is inconsistent, especially for older articles or non-English journals. The keyword fallback provides a safety net that catches the obvious cases.

The same logic applies down the hierarchy: "randomized" in the text suggests Level II, "cohort" or "observational" suggests Level III, "case report" suggests Level IV. These aren't perfect heuristics, but they're far better than defaulting everything to Level V.

The Level V Default#

If neither publication types nor keywords match any level, the function defaults to Level V. This is deliberate conservatism — if we can't determine the evidence level, we assume the weakest. In medicine, overestimating evidence quality is far more dangerous than underestimating it. A Level I badge on a Level IV study could mislead a clinician. A Level V badge on a Level II study is inconvenient but safe.


Medical Credibility Scoring#

Evidence level tells you about the study design. Credibility scoring tells you about the source. A meta-analysis published in The Lancet is more credible than one posted on a preprint server. Both might be Level I evidence, but they have very different levels of peer-review scrutiny.

The credibility scoring system uses a tiered approach:

# High-impact medical journals
_TIER1_JOURNALS = {
    "the lancet", "new england journal of medicine", "nejm", "jama",
    "bmj", "nature medicine", "nature", "science", "cell",
    "annals of internal medicine", "circulation",
    "journal of clinical oncology", "the lancet oncology",
    "lancet infectious diseases", "plos medicine",
}

# Government and institutional sources
_TIER1_DOMAINS = {
    "nih.gov", "cdc.gov", "who.int", "fda.gov", "clinicaltrials.gov",
    "cochranelibrary.com", "ncbi.nlm.nih.gov", "pubmed.ncbi.nlm.nih.gov",
}

# Preprint servers
_PREPRINT_DOMAINS = {
    "medrxiv.org", "biorxiv.org", "arxiv.org", "preprints.org",
}

The Scoring Function#

The score_medical_credibility function returns a score from 0-100 along with a tier, reasoning, and recommendation:

@tool
def score_medical_credibility(
    url: str, source_type: str,
    journal: str = "", domain: str = ""
) -> dict:
    """Score the credibility of a medical source."""
    journal_lower = journal.lower() if journal else ""

    # PubMed articles — high baseline credibility
    if source_type == "pubmed":
        if any(j in journal_lower for j in _TIER1_JOURNALS):
            return _result(url, 98, "tier1_journal",
                "High-impact peer-reviewed journal", "Safe to cite")
        return _result(url, 90, "peer_reviewed",
            "Peer-reviewed article indexed in PubMed", "Safe to cite")

    # Clinical trial registries
    if source_type == "clinicaltrials":
        return _result(url, 92, "trial_registry",
            "Registered clinical trial on ClinicalTrials.gov",
            "Safe to cite")

    # FDA data
    if source_type == "openfda":
        return _result(url, 95, "government",
            "Official FDA adverse event data", "Safe to cite")

    # Preprints — moderate credibility with warning
    if source_type == "preprint" or any(
        d in domain_lower for d in _PREPRINT_DOMAINS
    ):
        return _result(url, 55, "preprint",
            "Preprint — not yet peer-reviewed",
            "Cite with caution — not peer-reviewed")

    # Unknown web sources
    return _result(url, 40, "unverified",
        "Unverified web source",
        "Use caution — verify independently")

The Score Tiers#

SourceScoreTierRecommendation
NEJM, Lancet, JAMA, BMJ98tier1_journalSafe to cite
Any PubMed article90peer_reviewedSafe to cite
FDA data (OpenFDA)95governmentSafe to cite
ClinicalTrials.gov92trial_registrySafe to cite
Academic (.edu)80academicSafe to cite
Mayo Clinic, UpToDate70medical_siteRequires context
medRxiv, bioRxiv55preprintCite with caution
Unknown web source40unverifiedVerify independently

The gap between 90 (general PubMed) and 98 (tier-1 journal) is intentional. A PubMed-indexed article in the Journal of Obscure Regional Medicine is still peer-reviewed — that's worth a 90. But an article in the New England Journal of Medicine has survived the most rigorous peer review in medicine. That extra 8 points reflects the difference in editorial standards.

The preprint score of 55 is deliberately uncomfortable. Preprints are important — especially for emerging research — but they haven't been peer-reviewed. The score and the "Cite with caution" recommendation ensure that preprints are never presented with the same authority as peer-reviewed work.


Citation Format: Making Evidence Verifiable#

Every citation in the Clinical Research Agent is a clickable link. This isn't just good UX — it's a fundamental requirement for trust. A medical claim without a verifiable citation is just an assertion.

PMID Citations#

PubMed articles use PMID (PubMed Identifier) links:

PMID: 38547893 → https://pubmed.ncbi.nlm.nih.gov/38547893/

The URL format is stable and predictable. Any valid PMID can be turned into a direct link to the article's PubMed page, which includes the title, authors, abstract, journal, publication date, and links to the full text. This is the gold standard for medical citations — a single identifier that resolves to a complete bibliographic record.

In our frontend, PMID links render as clickable badges:

<a
  href={`https://pubmed.ncbi.nlm.nih.gov/${article.pmid}/`}
  target="_blank"
  rel="noopener noreferrer"
  className="inline-flex items-center gap-1 rounded-full
    border border-border bg-card px-2 py-0.5
    text-[10px] font-mono text-primary
    hover:border-primary/40 transition-colors"
>
  PMID: {article.pmid}
  <ExternalLink className="h-2.5 w-2.5" />
</a>

Every PMID badge opens in a new tab. The noopener noreferrer attributes are security best practices for external links. The monospace font signals that this is a technical identifier, not prose.

NCT ID Citations#

Clinical trials use NCT IDs (National Clinical Trial identifiers):

NCT05553626 → https://clinicaltrials.gov/study/NCT05553626

The NCT ID format follows ClinicalTrials.gov's v2 URL structure. Like PMIDs, these are stable identifiers that resolve to full trial records including the protocol, eligibility criteria, endpoints, enrollment status, and results (if available).

<a
  href={`https://clinicaltrials.gov/study/${trial.nct_id}`}
  target="_blank"
  rel="noopener noreferrer"
  className="inline-flex items-center gap-1 rounded-full
    border border-border bg-card px-2 py-0.5
    text-[10px] font-mono text-primary
    hover:border-primary/40 transition-colors"
>
  {trial.nct_id}
  <ExternalLink className="h-2.5 w-2.5" />
</a>

A non-clickable citation is a suggestion. A clickable citation is an invitation to verify. When a doctor reads "Level I meta-analysis (PMID: 38547893)" and can click through to the actual article in 2 seconds, they can confirm the claim themselves. When the PMID is just text, verification requires copying the ID, opening PubMed, and pasting it into the search bar. That small friction is the difference between citations that get checked and citations that get trusted on faith.

In medicine, trust on faith is how people get hurt.


Real Evidence from Our Tests#

When we query "GLP-1 receptor agonists for Type 2 Diabetes with cardiovascular disease," the evidence grader classifies the returned articles automatically. Here's what a real result set looks like:

ArticleJournalEvidence LevelCredibility
"GLP-1 receptor agonists for type 2 diabetes — systematic review and meta-analysis"BMJLevel I98
"Cardiovascular outcomes with GLP-1 receptor agonists: a meta-analysis of randomized clinical trials"DiabetologiaLevel I90
"Semaglutide and cardiovascular outcomes in patients with type 2 diabetes"NEJMLevel II98
"Cohort study of GLP-1 RA adherence and cardiovascular events"Diabetes CareLevel III90
"Expert consensus on GLP-1 RA use in clinical practice"Lancet Diabetes & EndocrinologyLevel V98

The first two articles — both meta-analyses — get emerald Level I badges and high credibility scores. The BMJ article scores 98 because BMJ is a tier-1 journal. The Diabetologia article scores 90 as a standard PubMed article. The NEJM article is Level II (an RCT) but scores 98 for credibility because NEJM is tier-1. The expert consensus is Level V (the weakest evidence level) despite being published in a tier-1 journal — because evidence level and credibility are orthogonal dimensions.

This separation matters. A Level V expert opinion in The Lancet is authoritative but not empirical. A Level I meta-analysis in a regional journal is empirical but may have less editorial scrutiny. The two scores together give a more nuanced picture than either alone.


Putting It Together: The Evidence Pipeline#

When an article arrives from PubMed, it flows through both scoring systems:

PubMed article
  ├── publication_types: ["Meta-Analysis", "Journal Article"]
  ├── title: "Systematic review of..."
  ├── journal: "BMJ"
  └── pmid: "38547893"
          │
          ├── grade_evidence(publication_types, title)
          │   └── Level I (emerald badge)
          │
          └── score_medical_credibility(url, "pubmed", journal="BMJ")
              └── Score: 98, Tier: tier1_journal

The frontend receives both scores and renders them together — the evidence level badge and the credibility bar appear on every article card. A user scanning the evidence panel can instantly see: "This is a Level I meta-analysis from a top-tier journal with 98% credibility" or "This is a Level IV case report from a general PubMed journal with 90% credibility."

The combination eliminates ambiguity. Without evidence grading, 10 articles look the same. Without credibility scoring, a preprint looks the same as a Lancet publication. With both, the user has the information they need to weigh the evidence — which is the entire point of evidence-based medicine.


What's Next#

Evidence grading tells you how strong each piece of evidence is. Credibility scoring tells you how much to trust the source. But the user still needs to see all of this — badges, bars, citations, trial cards, safety alerts — in an interface designed for medical research.

In Blog 5, we build the medical frontend: evidence panel with Level I-V badges, clinical trial cards with phase and status indicators, safety alert banners, and the disclaimer system that makes the whole thing responsible to deploy.


This is post 4 of 5 in the Clinical Research Series. The full series covers architecture, data sources, safety, evidence grading, and frontend deployment for a medical AI research tool.

All code is open source: github.com/MinhQuanBuiSco/clinical-research-agent