Back
AI Engineering··12 min

Building a Production RAG Q&A API with LangChain and FastAPI: Complete Guide

Learn how to build a production-ready Retrieval-Augmented Generation (RAG) API using LangChain, FastAPI, ChromaDB, and LangSmith. Complete tutorial with code examples and real-world implementation.

Building a Production RAG Q&A API with LangChain and FastAPI: Complete Guide#

Ever wanted to build an AI that can answer questions about your own documents? That's exactly what Retrieval-Augmented Generation (RAG) does, and it's surprisingly straightforward with the right tools.

In this tutorial, I'll show you how to build a production-ready RAG Q&A API that:

  • 📄 Ingests and chunks your documents automatically
  • 🔍 Uses semantic search to find relevant context
  • 🤖 Generates accurate answers with GPT-4o-mini
  • 📊 Tracks everything with LangSmith observability
  • ⚡ Exposes a fast REST API with FastAPI

By the end, you'll have a fully functional RAG system you can deploy and scale.


What is RAG and Why Do You Need It?#

The Problem with Pure LLMs#

Regular LLMs like GPT-4 are incredibly smart, but they have limitations:

  • ❌ Knowledge cutoff (doesn't know recent information)
  • ❌ No access to your private documents
  • ❌ Can hallucinate when uncertain
  • ❌ Can't cite sources from your data

Example: Ask GPT-4 about your company's internal policies, and it will guess or admit it doesn't know.

The RAG Solution#

RAG combines the best of both worlds:

User Question: "What's our return policy?"
        ↓
1. Search your documents for relevant sections
        ↓
2. Retrieve: "Products can be returned within 30 days..."
        ↓
3. Pass context + question to LLM
        ↓
4. LLM answers based on YOUR documents
        ↓
"According to your policy, products can be returned within 30 days of purchase with original receipt."

Benefits:

  • ✅ Answers based on YOUR data
  • ✅ Citable sources
  • ✅ Reduced hallucinations
  • ✅ Up-to-date information
  • ✅ Works with private/proprietary knowledge

Real-World Use Cases#

Use CaseWhat RAG Does
Customer SupportAnswer questions from product manuals, FAQs, policies
Legal ResearchSearch case law, contracts, regulations
Medical Q&AQuery research papers, clinical guidelines
Internal Knowledge BaseCompany wikis, documentation, SOPs
Technical DocumentationCode repositories, API docs, tutorials

What We're Building#

RAG API Architecture#

Documents (./documents/)
        ↓
[Text Chunking & Embedding]
        ↓
    ChromaDB Vector Store
        ↓
User Query → [Semantic Search] → Top 3 Relevant Chunks
        ↓
[GPT-4o-mini + Context] → Answer
        ↓
    FastAPI Response
        ↓
    LangSmith Tracking

Key Components#

  1. FastAPI - Fast, modern REST API framework
  2. LangChain - Document processing & LLM orchestration
  3. ChromaDB - Vector database for semantic search
  4. OpenAI Embeddings - Convert text to vectors
  5. GPT-4o-mini - Generate answers from context
  6. LangSmith - Observability and tracing

API Endpoints#

  • POST /ingest - Upload and process documents
  • POST /query - Ask questions about your documents
  • GET /health - Health check

Prerequisites#

What you need:

  • Python 3.10+
  • OpenAI API key
  • LangSmith API key (optional but recommended)
  • Basic understanding of APIs
  • 15 minutes to follow along

No prior RAG or vector database experience required!


Step 1: Project Setup#

Install Dependencies#

I use uv for fast dependency management:

# Using uv (recommended - 10-100x faster than pip!)
uv venv --python 3.10
uv sync

# Or using traditional pip
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install fastapi uvicorn langchain langchain-openai chromadb python-dotenv langsmith

Project Structure#

rag-api/
├── main.py           # FastAPI app + RAG logic
├── documents/        # Your .txt files go here
├── .env             # API keys
└── pyproject.toml   # Dependencies (if using uv)

Configure Environment#

Create .env:

OPENAI_API_KEY=your_openai_api_key_here
LANGSMITH_API_KEY=your_langsmith_key_here
LANGSMITH_PROJECT=rag_project

Security tip: Add .env to .gitignore and never commit API keys!


Step 2: Understanding RAG Components#

Document Chunking#

Why chunk? Documents are too long for LLM context windows. We split them into smaller pieces.

RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Each chunk ~1000 characters
    chunk_overlap=200     # Overlap to preserve context
)

How it works:

  • Tries to split on paragraphs first (\n\n)
  • Falls back to sentences (.)
  • Falls back to words if needed
  • 200-character overlap prevents cutting sentences in half

Embeddings convert text to numbers (vectors) that capture semantic meaning:

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# "The cat sat on the mat" → [0.23, -0.45, 0.67, ...]
# "A feline rested on the rug" → [0.21, -0.43, 0.69, ...]
# ↑ Similar meaning = similar vectors

ChromaDB stores these vectors and enables similarity search:

from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(chunks, embeddings)

# Find documents similar to query
docs = vector_store.similarity_search("return policy", k=3)

RAG Prompt Template#

We give the LLM context and ask it to answer:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer based on the context provided."),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

Step 3: Building the API#

File: main.py

import os
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException
from langchain.chat_models import init_chat_model
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langsmith import Client as LangSmithClient
from langsmith import traceable
from pydantic import BaseModel

load_dotenv()

# Initialize FastAPI app
app = FastAPI(title="RAG Q&A API")

# Initialize LangSmith for observability
langsmith_client = LangSmithClient()

# Initialize the language model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Global vector store (will be populated during ingestion)
vector_store = None


# Document ingestion function
async def ingest_documents(directory: str):
    global vector_store
    documents = []

    # Load all .txt files from directory
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            loader = TextLoader(os.path.join(directory, filename))
            documents.extend(loader.load())

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = text_splitter.split_documents(documents)

    # Create vector store with embeddings
    vector_store = Chroma.from_documents(chunks, embeddings)
    print(f"Ingested {len(chunks)} document chunks into vector store.")


# Pydantic model for API input
class QueryRequest(BaseModel):
    query: str


# Prompt template for RAG
prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a helpful assistant. Answer the user's question based on the provided context. "
        "If the context doesn't contain relevant information, say so and provide a general answer."
    ),
    ("human", "Context: {context}\n\nQuestion: {question}")
])


# Query processing with LangSmith tracing
@traceable(run_type="chain", project_name=os.getenv("LANGSMITH_PROJECT", "rag_project"))
async def process_query(query: str):
    if vector_store is None:
        raise ValueError("Vector store not initialized. Please ingest documents first.")

    try:
        # Semantic search for relevant chunks
        docs = vector_store.similarity_search(query, k=3)
        context = "\n".join([doc.page_content for doc in docs])

        # Format prompt with context
        prompt = prompt_template.format_messages(context=context, question=query)

        # Get LLM response
        response = await llm.ainvoke(prompt)
        return response.content
    except Exception as e:
        print(f"Error processing query: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing query: {str(e)}")


# FastAPI endpoint for querying
@app.post("/query")
async def query_endpoint(request: QueryRequest):
    response = await process_query(request.query)
    return {"answer": response}


# FastAPI endpoint for ingesting documents
@app.post("/ingest")
async def ingest_endpoint(directory: str = "./documents"):
    try:
        await ingest_documents(directory)
        return {"message": f"Successfully ingested documents from {directory}"}
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Error ingesting documents: {str(e)}"
        )


# Health check endpoint
@app.get("/health")
async def health_check():
    return {"status": "healthy"}


# Run the application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Code Breakdown#

1. Initialization (Lines 1-33):

  • Load environment variables
  • Set up FastAPI app
  • Initialize LangSmith for tracing
  • Initialize LLM (GPT-4o-mini)
  • Initialize OpenAI embeddings
  • Create global vector_store variable

2. Document Ingestion (Lines 36-51):

  • Load all .txt files from specified directory
  • Split into chunks (1000 chars with 200 overlap)
  • Create ChromaDB vector store with embeddings
  • Store globally for query access

3. Query Processing (Lines 71-86):

  • Decorated with @traceable for LangSmith tracking
  • Performs similarity search (k=3 top results)
  • Combines retrieved chunks into context
  • Formats prompt with context + question
  • Calls LLM asynchronously
  • Returns answer

4. API Endpoints (Lines 89-111):

  • /query - User asks questions, gets answers
  • /ingest - Upload documents to vector store
  • /health - Simple health check

Step 4: Running Your RAG API#

Start the Server#

python main.py

Output:

INFO:     Started server process [12345]
INFO:     Uvicorn running on http://0.0.0.0:8000

Test with curl#

1. Ingest Documents

First, create a sample document:

mkdir documents
echo "Our return policy allows returns within 30 days of purchase with original receipt. Refunds are processed within 5-7 business days." > documents/policy.txt

Ingest it:

curl -X POST "http://localhost:8000/ingest"

Response:

{
  "message": "Successfully ingested documents from ./documents"
}

2. Query Your Documents

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the return policy?"}'

Response:

{
  "answer": "According to the provided context, our return policy allows returns within 30 days of purchase with the original receipt. Refunds are processed within 5-7 business days after the return is received."
}

3. Ask Questions Outside Context

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the capital of France?"}'

Response:

{
  "answer": "The provided context does not contain information about the capital of France. However, the capital of France is Paris."
}

Notice how it acknowledges when context doesn't help!


Step 5: LangSmith Observability#

What is LangSmith?#

LangSmith is LangChain's observability platform. It tracks:

  • Every query your RAG system processes
  • Retrieved documents for each query
  • LLM prompts and responses
  • Latency and token usage
  • Errors and exceptions

Viewing Traces#

  1. Go to smith.langchain.com
  2. Find your project (e.g., "rag_project")
  3. Click on any trace to see:
    • User query
    • Retrieved chunks
    • Full prompt sent to LLM
    • LLM response
    • Total latency

Example trace:

Process Query (2.3s)
├─ Similarity Search (0.1s)
│  └─ Retrieved 3 documents
├─ Format Prompt (0.01s)
└─ LLM Call (2.2s)
   ├─ Tokens: 450 input, 85 output
   └─ Cost: $0.002

Why This Matters#

Production debugging: When users report wrong answers, check:

  • Was the right context retrieved?
  • Did the LLM hallucinate?
  • Was the prompt formatted correctly?

Example debug scenario:

  • User: "Why did my question get a wrong answer?"
  • You: Check LangSmith trace
  • Finding: Similarity search returned unrelated chunks
  • Fix: Adjust chunk_size or improve embeddings

Extending Your RAG System#

1. Support More Document Types#

from langchain.document_loaders import PDFLoader, CSVLoader, UnstructuredHTMLLoader

def ingest_documents(directory: str):
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            loader = PDFLoader(os.path.join(directory, filename))
        elif filename.endswith(".csv"):
            loader = CSVLoader(os.path.join(directory, filename))
        elif filename.endswith(".html"):
            loader = UnstructuredHTMLLoader(os.path.join(directory, filename))
        elif filename.endswith(".txt"):
            loader = TextLoader(os.path.join(directory, filename))
        else:
            continue
        documents.extend(loader.load())

    # Rest of ingestion logic...

2. Add Metadata Filtering#

Store metadata with chunks:

from langchain.schema import Document

# Add metadata during ingestion
docs_with_metadata = [
    Document(
        page_content=chunk.page_content,
        metadata={
            "source": filename,
            "chunk_id": i,
            "category": "policy"  # Custom metadata
        }
    )
    for i, chunk in enumerate(chunks)
]

vector_store = Chroma.from_documents(docs_with_metadata, embeddings)

# Filter during search
docs = vector_store.similarity_search(
    query,
    k=3,
    filter={"category": "policy"}  # Only search policy documents
)

3. Hybrid Search (Keywords + Semantic)#

Combine BM25 (keyword search) with vector search:

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks)
vector_retriever = vector_store.as_retriever()

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # 40% keyword, 60% semantic
)

docs = ensemble_retriever.get_relevant_documents(query)

4. Re-ranking for Better Results#

Use a re-ranker to improve retrieved document quality:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever()
)

docs = compression_retriever.get_relevant_documents(query)

5. Streaming Responses#

Stream answers for better UX:

from fastapi.responses import StreamingResponse

@app.post("/query-stream")
async def query_stream(request: QueryRequest):
    async def generate():
        async for chunk in llm.astream(prompt):
            yield chunk.content

    return StreamingResponse(generate(), media_type="text/plain")

6. Persistent Vector Store#

Save ChromaDB to disk to avoid re-ingesting:

# Save
vector_store = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./chroma_db"
)
vector_store.persist()

# Load
vector_store = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Production Considerations#

1. Error Handling#

Add retries and better error messages:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential())
async def process_query_with_retry(query: str):
    return await process_query(query)

2. Rate Limiting#

Prevent API abuse:

from fastapi_limiter import FastAPILimiter
from fastapi_limiter.depends import RateLimiter

@app.post("/query", dependencies=[Depends(RateLimiter(times=10, seconds=60))])
async def query_endpoint(request: QueryRequest):
    # Max 10 requests per minute
    pass

3. Authentication#

Add API key auth:

from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer

security = HTTPBearer()

def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != os.getenv("API_KEY"):
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.post("/query", dependencies=[Depends(verify_token)])
async def query_endpoint(request: QueryRequest):
    pass

4. Caching#

Cache frequent queries:

from functools import lru_cache

@lru_cache(maxsize=100)
def cached_similarity_search(query: str):
    return vector_store.similarity_search(query, k=3)

5. Async Document Ingestion#

Don't block API during ingestion:

from fastapi import BackgroundTasks

@app.post("/ingest")
async def ingest_endpoint(
    background_tasks: BackgroundTasks,
    directory: str = "./documents"
):
    background_tasks.add_task(ingest_documents, directory)
    return {"message": "Ingestion started in background"}

Real-World Use Cases#

1. Customer Support Chatbot#

# Ingest: Product manuals, FAQs, troubleshooting guides
# Query: "How do I reset my device?"
# Answer: Based on Section 3.2 of the user manual...
# Ingest: Case law, contracts, regulations
# Query: "What are the requirements for terminating an employee in California?"
# Answer: According to California Labor Code Section 2922...

3. Research Assistant#

# Ingest: Academic papers, research notes
# Query: "Summarize findings on RAG performance improvements"
# Answer: Based on 3 papers in your collection, key findings include...

4. Code Documentation Q&A#

# Ingest: README files, API docs, code comments
# Query: "How do I authenticate with the API?"
# Answer: According to the API documentation, authentication requires...

Common Issues and Debugging#

Issue 1: No Relevant Context Retrieved#

Symptom: LLM says "context doesn't contain information" even though it should

Debug:

# Check what was retrieved
docs = vector_store.similarity_search(query, k=3)
for doc in docs:
    print(f"Chunk: {doc.page_content[:200]}")
    print(f"Similarity score: {doc.metadata.get('score', 'N/A')}")

Fixes:

  • Increase k (retrieve more chunks)
  • Reduce chunk_size (smaller, more focused chunks)
  • Try different embedding model
  • Add hybrid search (keywords + semantic)

Issue 2: Slow Queries#

Symptom: API takes 5+ seconds to respond

Profile:

import time

start = time.time()
docs = vector_store.similarity_search(query, k=3)
print(f"Search: {time.time() - start}s")

start = time.time()
response = await llm.ainvoke(prompt)
print(f"LLM: {time.time() - start}s")

Fixes:

  • Use persistent vector store (don't rebuild on startup)
  • Switch to faster LLM (gpt-3.5-turbo)
  • Cache frequent queries
  • Use async/await properly

Issue 3: Hallucinations#

Symptom: LLM makes up information not in context

Fix:

prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        "ONLY answer based on the context provided. "
        "If the context doesn't contain the answer, say 'I don't have information about that in my documents.'"
        "Never make up information."
    ),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

Next Steps#

Beginner Projects#

  1. Add web interface: Build a simple chat UI with Streamlit or Gradio
  2. Multi-document support: Upload documents via API instead of file system
  3. Citation tracking: Return which document chunks were used for the answer

Advanced Projects#

  1. Multi-modal RAG: Add support for images and tables in documents
  2. Conversational RAG: Maintain chat history and context across queries
  3. Agentic RAG: Let AI decide when to search vs when to use general knowledge
  4. Evaluation pipeline: Automatically test answer quality with metrics

Conclusion#

You've built a production-ready RAG Q&A API! Here's what you learned:

Understand RAG architecture and why it's powerful ✅ Implement document ingestion with chunking and embeddings ✅ Build semantic search with ChromaDB vector database ✅ Create FastAPI endpoints for querying and ingestion ✅ Integrate LangSmith for observability and debugging ✅ Extend with advanced features like filtering and streaming

The power of RAG: You can now build AI systems that answer questions about ANY documents, from customer support to legal research!


Resources#

Try it yourself: Clone the repository and experiment!

git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git
cd AI_CheatSheet/langchain_langsmith_RAG
# Follow setup instructions in README

Questions? The code is open source - issues and PRs welcome!

Happy building! 🚀📚