Back
View source
AI Engineering··18 min

Deep Research Agent Series — Blog 5: Cloud-Native Infrastructure on AWS

Deploy a production AI research platform with 9 AWS CDK stacks — VPC with NAT Gateway, ECS Fargate behind ALB, DynamoDB single-table design, S3 for reports, SQS with DLQ for async jobs, and CloudFront multi-origin CDN.

Deep Research Agent Series — Blog 5: Cloud-Native Infrastructure on AWS#

Code without infrastructure is a demo. You can build the most elegant multi-agent pipeline in the world, but if it only runs on localhost:8000, it's a science project. In this post, we deploy the Deep Research Agent to AWS with 9 CDK stacks — from VPC networking to CloudFront CDN. Every piece is Infrastructure as Code, environment-aware, and production-ready.


Series Navigation#

PartTopicStatus
1Architecture & VisionPublished
2Multi-Agent OrchestrationPublished
3Smart Search & Source IntelligencePublished
4Real-time Streaming & WebSocketPublished
5Cloud-Native Infrastructure on AWSYou are here
6Security & Production HardeningPublished

The 9-Stack Architecture#

The infrastructure is split into 9 independent CDK stacks. Each stack owns a single concern and exports values that downstream stacks consume. This means you can deploy, update, or tear down any layer without touching the rest.

StackPurposeKey Resources
NetworkVPC & connectivityVPC, 2 AZs, NAT Gateway, VPC Endpoints
DataPersistence layerDynamoDB table, S3 bucket
AuthUser managementCognito User Pool, Google OAuth
ComputeApplication runtimeECS Fargate, ALB, auto-scaling
WAFAPI protectionRate limiting, OWASP rules, bot control
CDNContent deliveryCloudFront, S3 origin, ALB origin
GuardrailsAI safetyBedrock Guardrails (content, PII, prompt injection)
ObservabilityMonitoringCloudWatch dashboard, alarms
CICDDeploymentGitHub OIDC federation

Environment configuration lives in config/dev.py and config/prod.py. A single config object flows through every stack, controlling instance sizes, scaling limits, domain names, and feature flags. No hardcoded values, no environment drift.


VPC & Network Architecture#

The network stack is the foundation everything else builds on. The design optimizes for security, cost, and the specific needs of an AI research platform that makes outbound API calls.

Two Availability Zones provide high availability. If one AZ goes down, the ALB routes traffic to the other. For a dev environment this is sufficient — production can extend to three.

Public subnets host the Application Load Balancer and the NAT Gateway. Nothing else runs here. The ALB terminates HTTPS and forwards traffic to private subnets. The NAT Gateway gives private resources a path to the internet.

Private subnets host every Fargate task. The application containers never have public IP addresses. All outbound traffic — Tavily API calls, Bedrock invocations, anything external — routes through the NAT Gateway.

VPC Endpoints are the cost optimization play. Instead of routing AWS service traffic through the NAT Gateway (which charges per GB), we create private endpoints for Bedrock, S3, DynamoDB, SQS, and Secrets Manager. Traffic stays on the AWS backbone, saves data transfer costs, and reduces latency.

Internet
    |
CloudFront --> [WAF]
    |
+-------------------------------------+
|              VPC                     |
|  +-----------+   +-----------+      |
|  | Public    |   | Public    |      |
|  | Subnet    |   | Subnet    |      |
|  | (ALB)     |   | (NAT GW)  |      |
|  +-----+-----+   +-----------+      |
|        |                             |
|  +-----------+   +-----------+      |
|  | Private   |   | Private   |      |
|  | Subnet    |   | Subnet    |      |
|  | (Fargate) |   | (Fargate) |      |
|  +-----------+   +-----------+      |
|                                      |
|  VPC Endpoints: Bedrock, S3,         |
|  DynamoDB, SQS, Secrets Manager      |
+-------------------------------------+

The NAT Gateway is the single most expensive resource in this architecture (~$30/month). For development, one NAT Gateway is enough. In production, you'd deploy one per AZ for redundancy.


DynamoDB Single-Table Design#

Instead of creating separate tables for conversations, messages, and research jobs, we use a single-table design with composite partition and sort keys. All entities coexist in one table, differentiated by key prefixes.

EntityPKSKExample
ConversationUSER#{user_id}CONV#{conv_id}User's conversations
MessageCONV#{conv_id}MSG#{timestamp}#{msg_id}Conversation messages
Research JobSESSION#{session_id}RESEARCH#{research_id}Async research jobs

This design has three concrete benefits. First, one table means one billing item — no managing provisioned capacity across multiple tables. We use pay-per-request pricing, which is perfect for variable workloads where traffic spikes during research bursts and drops to near zero overnight. Second, all queries for a user live in the same partition. Listing a user's conversations is a single query with a begins_with filter on the sort key. Third, access patterns are explicit. Every query maps to a key condition — no scans, no secondary indexes for the core flows.

# List a user's conversations — single partition query
table.query(
    KeyConditionExpression="PK = :pk AND begins_with(SK, :prefix)",
    ExpressionAttributeValues={
        ":pk": f"USER#{user_id}",
        ":prefix": "CONV#",
    },
)

# Get all messages in a conversation — sorted by timestamp
table.query(
    KeyConditionExpression="PK = :pk AND begins_with(SK, :prefix)",
    ExpressionAttributeValues={
        ":pk": f"CONV#{conv_id}",
        ":prefix": "MSG#",
    },
)

Messages are naturally sorted by timestamp because the sort key embeds it. No need for a separate ScanIndexForward — the default ascending order is exactly what a chat history needs.


ECS Fargate & ALB#

The compute stack runs the FastAPI backend as containerized Fargate tasks behind an Application Load Balancer.

Fargate tasks run in private subnets with no public IP. Each task gets 1 vCPU and 2 GB of memory — enough for the FastAPI server, WebSocket connections, and the Strands SDK agent runtime. The Docker image uses a multi-stage build: a builder stage installs dependencies with uv, and the runtime stage copies only the virtual environment. Final image size sits under 300 MB.

The ALB lives in public subnets and terminates HTTPS using an ACM certificate. It routes /api/* and /ws/* to the Fargate target group. Health checks hit the /health endpoint every 30 seconds — if a task fails three consecutive checks, the ALB drains connections and ECS replaces it.

Auto-scaling is configured on both CPU and memory utilization. When average CPU exceeds 70%, ECS adds tasks (up to a configurable maximum). When load drops, it scales back down after a cooldown period. For dev, min/max is 1/2. For production, 2/6.

The key detail here is that WebSocket connections are long-lived. The ALB's idle timeout is set to 3600 seconds (one hour) to avoid dropping research sessions mid-stream. The default 60-second timeout would kill most deep research jobs before they finish.


SQS for Async Research#

Not every research job needs to block a WebSocket connection. For programmatic API access and batch research, we use an SQS queue with a dedicated worker process.

The architecture is simple: main queue + dead letter queue (DLQ). The worker polls the main queue with 20-second long polling, which minimizes empty responses and reduces API calls. Each message gets a 5-minute visibility timeout — enough time for even complex research pipelines to complete. If processing fails, the message becomes visible again for retry. After the maximum retry count, SQS automatically moves it to the DLQ for manual inspection.

def poll_once():
    """Poll SQS for one research job and process it."""
    sqs = _get_sqs_client()
    response = sqs.receive_message(
        QueueUrl=settings.research_queue_url,
        MaxNumberOfMessages=1,
        WaitTimeSeconds=20,
        VisibilityTimeout=300,
    )
    messages = response.get("Messages", [])
    if not messages:
        return

    message = messages[0]
    try:
        job = json.loads(message["Body"])
        result = run_research_pipeline(job["query"], job["session_id"])
        save_result(result)
        sqs.delete_message(
            QueueUrl=settings.research_queue_url,
            ReceiptHandle=message["ReceiptHandle"],
        )
    except Exception:
        logger.exception("Research job failed, leaving for retry")
        # Message becomes visible again after visibility timeout
        # After max retries, SQS moves it to DLQ

The DLQ is critical for production. Without it, a poison message — one that always fails processing — would loop forever, consuming compute and blocking the queue. The DLQ captures these failures so you can diagnose and replay them.


CloudFront Multi-Origin CDN#

CloudFront sits in front of everything, serving as both a CDN for static assets and a reverse proxy for the API. The distribution uses two origins with behavior-based routing.

S3 origin serves the React SPA. The root path (/) and all static assets (.html, .js, .css, .svg) are fetched from an S3 bucket via an Origin Access Identity. CloudFront caches these at edge locations worldwide, so users in Tokyo and London get the same sub-50ms load times.

ALB origin handles API and WebSocket traffic. Paths matching /api/* and /ws/* are forwarded to the Application Load Balancer. CloudFront passes all headers, cookies, and query strings through to the backend — no caching on dynamic content. A custom header (X-Origin-Verify) shared between CloudFront and ALB ensures that direct-to-ALB requests are rejected.

CloudFront
├── /        --> S3 (React SPA)
├── /api/*   --> ALB (FastAPI)
└── /ws/*    --> ALB (WebSocket)

SPA routing requires special handling. When a user navigates to /chat/abc123 and refreshes the page, S3 returns a 404 because that file doesn't exist. A CloudFront Function intercepts these requests and rewrites them to /index.html, letting React Router handle the route client-side. No more 404 pages on refresh.


Deployment#

The entire infrastructure deploys with a single command:

./deploy.sh dev   # Sources .env, creates Google OAuth provider, deploys all 9 stacks

The deploy script handles the ceremony that CDK alone cannot. It sources environment variables from .env (Tavily API keys, Google OAuth credentials), creates the Cognito identity provider for Google sign-in (which must exist before the CDK stack deploys), and then runs cdk deploy --all with the correct environment context.

Stack dependencies are declared explicitly in CDK. The compute stack depends on the network and data stacks. The CDN stack depends on compute. CDK resolves the dependency graph and deploys in the right order automatically. A full deploy from scratch takes about 15 minutes — most of that time is CloudFront distribution creation.


Cost Breakdown#

Running this platform on AWS costs less than a mid-tier SaaS subscription. Here's the monthly breakdown for a development environment:

ServiceMonthly Cost (Dev)
ECS Fargate (1 task, 1 vCPU, 2 GB)~$25
NAT Gateway (1 AZ)~$30
ALB~$15
DynamoDB (pay-per-request)~$0 (free tier)
CloudFront~$0 (free tier)
SQS~$0
S3~$0
Total~$70/month

The NAT Gateway is the surprising cost leader — not compute, not the LLM calls, but a networking component. In production, you can reduce this by routing more traffic through VPC Endpoints or by using a NAT Instance (cheaper but less managed). The DynamoDB and SQS costs are effectively zero at development-scale traffic.


What's Next#

Infrastructure gets you to production. Keeping you there requires security. In Blog 6: Security & Production Hardening, we'll cover Cognito JWT authentication with Google OAuth, WAF rules for rate limiting and bot control, Bedrock Guardrails for content safety and PII filtering, and GitHub OIDC federation for CI/CD without long-lived credentials.


All code is open source: github.com/MinhQuanBuiSco/deep-research-agent