Clinical Research Series — Blog 2: PubMed & ClinicalTrials.gov Integration#
36 million peer-reviewed articles. 500,000 clinical trials. All searchable via free APIs. No API key required.
PubMed, ClinicalTrials.gov, and OpenFDA are the three pillars of open medical data. PubMed indexes every biomedical journal worth reading — from the New England Journal of Medicine to obscure regional journals. ClinicalTrials.gov tracks every registered clinical trial in the US and most international ones. OpenFDA provides adverse event reports for every approved drug. Together, they give you access to the same primary literature that clinicians and researchers use to make decisions.
In this post, I'll show how the Clinical Research Agent integrates all three — with XML parsing for PubMed, JSON extraction for ClinicalTrials.gov, adverse event queries for OpenFDA, and circuit breakers that keep the pipeline running when any individual API goes down.
The Clinical Research Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Architecture & Domain Analysis | System design, evidence hierarchy, medical APIs |
| 2 | PubMed & ClinicalTrials.gov Integration (this post) | E-utilities, v2 API, OpenFDA, circuit breakers |
| 3 | Safety & Hallucination Prevention | Uncited claim detection, credibility scoring |
| 4 | Evidence Grading & Citation Quality | Level I-V, PMID/NCT citations, journal tiers |
| 5 | Medical Frontend & Production Readiness | Teal theme, EvidencePanel, SafetyAlert |
PubMed E-utilities: The Two-Step Pipeline#
PubMed's API is called E-utilities. It's old-school — XML-based, no JSON option for article data — but it's incredibly reliable and comprehensive. The search works in two steps:
- ESearch — send a query, get back a list of PubMed IDs (PMIDs)
- EFetch — send PMIDs, get back full article metadata as XML
This two-step pattern is deliberate. ESearch is fast and returns just IDs. EFetch is slower but returns everything: title, authors, abstract, journal, publication types, DOIs. By separating them, PubMed lets you do lightweight searches without pulling full article data.
Step 1: ESearch — Finding PMIDs#
EUTILS_BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" @tool def pubmed_search(query: str, max_results: int = 10) -> dict: """Search PubMed for peer-reviewed biomedical literature. Args: query: Medical search query (supports MeSH terms). max_results: Maximum articles to return (default 10). Returns: Dict with 'articles' containing title, authors, journal, year, abstract, pmid, doi. """
The @tool decorator comes from the Strands Agents SDK. It exposes this function as a tool the agent can call — the agent sees the docstring, decides when to use it, and passes the arguments. The search URL construction is straightforward:
api_key_param = f"&api_key={settings.pubmed_api_key}" if settings.pubmed_api_key else "" # Step 1: ESearch — get PMIDs search_url = ( f"{EUTILS_BASE}/esearch.fcgi?db=pubmed&retmode=json&retmax={max_results}" f"&sort=relevance&term={query}{api_key_param}" ) with httpx.Client(timeout=15) as client: search_resp = client.get(search_url) search_resp.raise_for_status() search_data = search_resp.json() pmids = search_data.get("esearchresult", {}).get("idlist", []) if not pmids: return {"articles": [], "total_count": 0} total_count = int(search_data.get("esearchresult", {}).get("count", 0))
ESearch supports MeSH terms — PubMed's controlled medical vocabulary. When the agent sends "Type 2 Diabetes AND GLP-1 receptor agonists", PubMed automatically maps these to MeSH headings and expands the search. The sort=relevance parameter uses PubMed's own relevance algorithm, which weighs title matches, recency, and citation count.
Step 2: EFetch — Extracting Article Metadata#
This is where it gets interesting. EFetch returns XML — not JSON — and the XML structure is deeply nested. Every article lives inside a <PubmedArticle> element with <MedlineCitation> and <Article> children:
# Step 2: EFetch — get article details (XML) ids_str = ",".join(pmids) fetch_url = ( f"{EUTILS_BASE}/efetch.fcgi?db=pubmed&retmode=xml&id={ids_str}{api_key_param}" ) with httpx.Client(timeout=30) as client: fetch_resp = client.get(fetch_url) fetch_resp.raise_for_status() # Parse XML import xml.etree.ElementTree as ET root = ET.fromstring(fetch_resp.text)
The timeout is 30 seconds for EFetch versus 15 for ESearch. Fetching full article data for 10 PMIDs can take a while, especially when PubMed is under load.
Extracting Structured Fields#
Each article requires careful extraction. Here's how we pull the key fields:
for article_elem in root.findall(".//PubmedArticle"): medline = article_elem.find("MedlineCitation") if medline is None: continue pmid = medline.findtext("PMID", "") article = medline.find("Article") if article is None: continue # Title title = article.findtext("ArticleTitle", "") # Authors (max 5 to save tokens) authors = [] author_list = article.find("AuthorList") if author_list is not None: for author in author_list.findall("Author")[:5]: last = author.findtext("LastName", "") first = author.findtext("ForeName", "") if last: authors.append(f"{last} {first}".strip()) # Journal journal_elem = article.find("Journal") journal = journal_elem.findtext("Title", "") if journal_elem is not None else "" # Year (with MedlineDate fallback) pub_date = article.find(".//PubDate") year = pub_date.findtext("Year", "") if pub_date is not None else "" if not year: medline_date = pub_date.findtext("MedlineDate", "") if pub_date is not None else "" year = medline_date[:4] if medline_date else ""
The year extraction has a fallback worth explaining. Most PubMed articles have a clean <Year> element. But some older articles or articles from certain journals use a <MedlineDate> format like "2024 Jan-Feb". We grab the first 4 characters as the year. It's not perfect, but it handles 99% of cases.
Abstracts and Publication Types#
Abstracts in PubMed are often structured — split into labeled sections like "BACKGROUND:", "METHODS:", "RESULTS:", "CONCLUSIONS:". We concatenate them:
# Abstract abstract_elem = article.find("Abstract") abstract = "" if abstract_elem is not None: abstract_parts = abstract_elem.findall("AbstractText") abstract = " ".join( (part.get("Label", "") + ": " if part.get("Label") else "") + (part.text or "") for part in abstract_parts ) # DOI doi = "" for id_elem in article_elem.findall(".//ArticleId"): if id_elem.get("IdType") == "doi": doi = id_elem.text or "" break # Publication type (for evidence grading) pub_types = [] for pt in article.findall(".//PublicationType"): if pt.text: pub_types.append(pt.text)
The publication_types field is critical. It's what the evidence grader uses to classify articles as Level I (meta-analysis), Level II (RCT), Level III (cohort study), and so on. Without this field, we'd have no way to grade evidence strength.
The final article dict includes everything the downstream pipeline needs:
articles.append({ "pmid": pmid, "title": title, "authors": authors, "journal": journal, "journal_abbrev": journal_abbrev, "year": year, "abstract": abstract[:1000], # Truncate long abstracts "doi": doi, "url": f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/", "publication_types": pub_types, })
We truncate abstracts to 1000 characters. A full abstract can be 3000+ characters, and when you're pulling 10 articles per sub-query across 3-4 sub-queries, that adds up fast. The first 1000 characters usually contain the objective, methods, and key results — enough for the synthesis agent to work with.
ClinicalTrials.gov v2 API: Structured Trial Data#
ClinicalTrials.gov recently migrated from their legacy XML API to a modern JSON REST API (v2). This is a significant improvement — the v2 API returns clean, structured JSON with predictable field paths.
API_BASE = "https://clinicaltrials.gov/api/v2" @tool def clinicaltrials_search(condition: str, intervention: str = "", max_results: int = 10) -> dict: """Search ClinicalTrials.gov for clinical trials. Args: condition: Medical condition (e.g., 'Type 2 Diabetes'). intervention: Drug or procedure name (optional). max_results: Maximum trials to return (default 10). Returns: Dict with 'trials' containing nct_id, title, status, phase, sponsor, conditions, interventions. """
The tool takes a condition and optional intervention. This maps directly to how clinicians think about trials: "What trials exist for [condition] using [intervention]?" The search combines them with AND logic:
# Build query query_parts = [condition] if intervention: query_parts.append(intervention) query_str = " AND ".join(query_parts) params = { "query.term": query_str, "pageSize": min(max_results, 20), "format": "json", "fields": "NCTId,BriefTitle,OverallStatus,Phase,LeadSponsorName,Condition," "InterventionName,StartDate,PrimaryCompletionDate,EnrollmentCount," "StudyType,BriefSummary", "sort": "LastUpdatePostDate:desc", } with httpx.Client(timeout=15) as client: resp = client.get(f"{API_BASE}/studies", params=params) resp.raise_for_status() data = resp.json()
The fields parameter is important. ClinicalTrials.gov returns dozens of fields per study — we request only the ones we need to keep the response size manageable. The sort=LastUpdatePostDate:desc ensures we see the most recently updated trials first, which are more likely to have current results.
Parsing the Deeply Nested Response#
The v2 API response is well-structured but deeply nested. Every piece of data lives inside protocolSection modules:
for study in data.get("studies", []): protocol = study.get("protocolSection", {}) id_module = protocol.get("identificationModule", {}) status_module = protocol.get("statusModule", {}) sponsor_module = protocol.get("sponsorCollaboratorsModule", {}) design_module = protocol.get("designModule", {}) conditions_module = protocol.get("conditionsModule", {}) interventions_module = protocol.get("armsInterventionsModule", {}) description_module = protocol.get("descriptionModule", {}) nct_id = id_module.get("nctId", "") # Extract interventions intervention_names = [] for iv in interventions_module.get("interventions", []): name = iv.get("name", "") if name: intervention_names.append(name) trials.append({ "nct_id": nct_id, "title": id_module.get("briefTitle", ""), "status": status_module.get("overallStatus", ""), "phase": ", ".join(design_module.get("phases", [])), "sponsor": sponsor_module.get("leadSponsor", {}).get("name", ""), "conditions": conditions_module.get("conditions", []), "interventions": intervention_names, "enrollment": design_module.get("enrollmentInfo", {}).get("count", ""), "study_type": design_module.get("studyType", ""), "summary": description_module.get("briefSummary", "")[:500], "url": f"https://clinicaltrials.gov/study/{nct_id}", })
The NCT ID (like NCT01234567) is the universal identifier for clinical trials. It's the trial equivalent of a PMID — a unique, persistent identifier that anyone can use to look up the full trial record. Our synthesized reports cite trials by NCT ID, just as they cite articles by PMID.
OpenFDA: Drug Adverse Events#
OpenFDA provides access to FDA's adverse event reporting system (FAERS). When a patient or healthcare provider reports a side effect, it goes into FAERS. OpenFDA makes this data searchable via a simple REST API:
API_BASE = "https://api.fda.gov/drug" @tool def openfda_drug_search(drug_name: str, max_results: int = 5) -> dict: """Search OpenFDA for drug adverse events and safety information. Args: drug_name: Name of the drug (e.g., 'metformin'). max_results: Maximum results to return (default 5). Returns: Dict with 'events' containing drug info, reactions, seriousness, and report date. """
The search query uses OpenFDA's field-specific syntax. We search by patient.drug.medicinalproduct — the drug name as reported:
search_query = f'patient.drug.medicinalproduct:"{drug_name}"' params = { "search": search_query, "limit": min(max_results, 10), } with httpx.Client(timeout=15) as client: resp = client.get(f"{API_BASE}/event.json", params=params) resp.raise_for_status() data = resp.json()
Extracting Adverse Event Data#
Each adverse event report contains the drugs involved, the reactions experienced, and seriousness indicators:
for result in data.get("results", []): # Extract drug info drugs = [] for drug in result.get("patient", {}).get("drug", []): drugs.append({ "name": drug.get("medicinalproduct", ""), "indication": drug.get("drugindication", ""), "dosage": drug.get("drugdosagetext", ""), "route": drug.get("drugadministrationroute", ""), }) # Extract reactions reactions = [ r.get("reactionmeddrapt", "") for r in result.get("patient", {}).get("reaction", []) if r.get("reactionmeddrapt") ] events.append({ "report_date": result.get("receivedate", ""), "serious": result.get("serious", ""), "seriousness_death": result.get("seriousnessdeath", ""), "seriousness_hospitalization": result.get("seriousnesshospitalization", ""), "drugs": drugs[:3], "reactions": reactions[:10], "country": result.get("occurcountry", ""), })
The seriousness fields are particularly important for safety. OpenFDA categorizes events by outcome — death, hospitalization, life-threatening, disability. When the agent surfaces adverse events, these fields help distinguish between minor side effects and serious safety signals.
We limit drugs to 3 per event and reactions to 10. Adverse event reports can list dozens of concomitant medications and reactions, but for our purposes, we need the signal, not the noise.
The Circuit Breaker Pattern#
All three medical APIs use the same circuit breaker pattern. This is critical for a multi-source research pipeline — if PubMed goes down, you don't want the entire query to fail. You want the agent to continue with ClinicalTrials.gov and OpenFDA, note that PubMed was unavailable, and still produce a useful (if incomplete) report.
Here's the pattern, shared across all three tools:
_failure_count = 0 _circuit_open_until = 0.0 _FAILURE_THRESHOLD = 3 _RECOVERY_TIMEOUT = 120 # Higher for medical APIs
The circuit has three states:
- Closed (normal) — requests go through.
_failure_count < 3. - Open (tripped) — requests are immediately rejected with a
circuit_breaker: Trueflag._failure_count >= 3andtime.time() < _circuit_open_until. - Half-open (recovery) — the timeout has elapsed, so we reset the counter and try again. If it fails again, the circuit reopens.
# At the top of every tool call: if _failure_count >= _FAILURE_THRESHOLD: if time.time() < _circuit_open_until: return {"articles": [], "error": "PubMed temporarily unavailable", "circuit_breaker": True} _failure_count = 0 # Half-open: try again # At the bottom, on failure: except Exception as e: _failure_count += 1 if _failure_count >= _FAILURE_THRESHOLD: _circuit_open_until = time.time() + _RECOVERY_TIMEOUT logger.error("pubmed_circuit_opened", failures=_failure_count) return {"articles": [], "error": str(e)}
The recovery timeout is 120 seconds — deliberately higher than the 60 seconds you might use for a typical web API. Medical APIs are government-run and sometimes experience sustained slowdowns during peak hours. A 2-minute backoff gives them time to recover without hammering them with retries.
When a circuit opens, the agent sees a response like {"articles": [], "error": "PubMed temporarily unavailable", "circuit_breaker": True}. The agent knows the source is down and can adjust its synthesis accordingly — noting which databases were unavailable rather than presenting incomplete results as complete.
Rate Limits: What You Need to Know#
Each API has different rate limits, and exceeding them will get your IP temporarily blocked:
| API | Rate Limit | With API Key | Notes |
|---|---|---|---|
| PubMed E-utilities | 3 requests/second | 10 requests/second | Free API key from NCBI |
| ClinicalTrials.gov v2 | No published limit | N/A | No API key needed |
| OpenFDA | 240 requests/minute | 240 requests/minute | No API key needed |
PubMed is the most restrictive. At 3 requests per second without a key, a single research query with 4 sub-queries (each making 1 ESearch + 1 EFetch = 2 calls) consumes 8 requests — almost 3 seconds of rate limit capacity. With a free NCBI API key (sign up at ncbi.nlm.nih.gov), this drops to under 1 second.
We handle this implicitly through the circuit breaker rather than explicit rate limiting. If PubMed returns a 429 (Too Many Requests), it counts as a failure. Three failures in a row trips the circuit, and the 120-second recovery timeout ensures we don't pile on.
ClinicalTrials.gov v2 is the most generous — no published rate limit and no API key required. OpenFDA's 240/minute limit is rarely a concern since we only fetch 5 events per drug query.
Source-Capturing Tool Wrappers#
In the orchestrator, the raw search tools are wrapped with source-capturing versions. These wrappers call the real tool, then enrich each result with evidence grading and credibility scoring before adding it to a shared capture list:
@tool def capturing_pubmed_search(query: str, max_results: int = 10) -> dict: """Search PubMed for peer-reviewed biomedical literature.""" result = pubmed_search(query=query, max_results=max_results) for article in result.get("articles", []): pmid = article.get("pmid", "") if pmid and not any(a["pmid"] == pmid for a in captured_articles): # Grade evidence based on publication types evidence = grade_evidence( publication_types=article.get("publication_types", []), title=article.get("title", ""), abstract=article.get("abstract", ""), ) # Score credibility cred = score_medical_credibility( url=article.get("url", ""), source_type="pubmed", journal=article.get("journal", ""), ) enriched = { **article, "evidence_level": evidence.level.value, "evidence_label": evidence.label, "evidence_description": evidence.description, "evidence_color": evidence.color, "credibility_score": cred.get("score", 90), "credibility_tier": cred.get("tier", "peer_reviewed"), } captured_articles.append(enriched)
This pattern is worth studying. The wrapper:
- Calls the real tool (which handles the API call and circuit breaker)
- Deduplicates by PMID (
not any(a["pmid"] == pmid for a in captured_articles)) - Grades evidence level using publication type metadata
- Scores credibility using journal name and domain
- Enriches the article dict with grading and credibility fields
- Adds to both the typed list (
captured_articles) and the unified source list (captured_sources)
The same pattern applies to capturing_clinicaltrials_search. By the time the orchestrator reaches the synthesis step, it has a fully enriched, deduplicated set of articles and trials — each with evidence level, credibility score, and a URL the user can click to verify.
What's Next#
We now have three medical data sources integrated with circuit breakers, structured extraction, and source enrichment. But raw data isn't safe data. In Blog 3: Safety & Hallucination Prevention, we'll build the guardrails that catch uncited medical claims, flag dose recommendations, score source credibility, and ensure every response includes a medical disclaimer — because in healthcare, what the agent doesn't say matters as much as what it does.
All code is open source: github.com/MinhQuanBuiSco/clinical-research-agent