Building a Production RAG Q&A API with LangChain and FastAPI: Complete Guide#

Ever wanted to build an AI that can answer questions about your own documents? That's exactly what Retrieval-Augmented Generation (RAG) does, and it's surprisingly straightforward with the right tools.

In this tutorial, I'll show you how to build a production-ready RAG Q&A API that:

📄 Ingests and chunks your documents automatically
🔍 Uses semantic search to find relevant context
🤖 Generates accurate answers with GPT-4o-mini
📊 Tracks everything with LangSmith observability
⚡ Exposes a fast REST API with FastAPI

By the end, you'll have a fully functional RAG system you can deploy and scale.

What is RAG and Why Do You Need It?#

The Problem with Pure LLMs#

Regular LLMs like GPT-4 are incredibly smart, but they have limitations:

❌ Knowledge cutoff (doesn't know recent information)
❌ No access to your private documents
❌ Can hallucinate when uncertain
❌ Can't cite sources from your data

Example: Ask GPT-4 about your company's internal policies, and it will guess or admit it doesn't know.

The RAG Solution#

RAG combines the best of both worlds:

User Question: "What's our return policy?"
        ↓
1. Search your documents for relevant sections
        ↓
2. Retrieve: "Products can be returned within 30 days..."
        ↓
3. Pass context + question to LLM
        ↓
4. LLM answers based on YOUR documents
        ↓
"According to your policy, products can be returned within 30 days of purchase with original receipt."

Benefits:

✅ Answers based on YOUR data
✅ Citable sources
✅ Reduced hallucinations
✅ Up-to-date information
✅ Works with private/proprietary knowledge

Real-World Use Cases#

Use Case	What RAG Does
Customer Support	Answer questions from product manuals, FAQs, policies
Legal Research	Search case law, contracts, regulations
Medical Q&A	Query research papers, clinical guidelines
Internal Knowledge Base	Company wikis, documentation, SOPs
Technical Documentation	Code repositories, API docs, tutorials

What We're Building#

RAG API Architecture#

Documents (./documents/)
        ↓
[Text Chunking & Embedding]
        ↓
    ChromaDB Vector Store
        ↓
User Query → [Semantic Search] → Top 3 Relevant Chunks
        ↓
[GPT-4o-mini + Context] → Answer
        ↓
    FastAPI Response
        ↓
    LangSmith Tracking

Key Components#

FastAPI - Fast, modern REST API framework
LangChain - Document processing & LLM orchestration
ChromaDB - Vector database for semantic search
OpenAI Embeddings - Convert text to vectors
GPT-4o-mini - Generate answers from context
LangSmith - Observability and tracing

API Endpoints#

POST /ingest - Upload and process documents
POST /query - Ask questions about your documents
GET /health - Health check

Prerequisites#

What you need:

Python 3.10+
OpenAI API key
LangSmith API key (optional but recommended)
Basic understanding of APIs
15 minutes to follow along

No prior RAG or vector database experience required!

Step 1: Project Setup#

Install Dependencies#

I use uv for fast dependency management:

# Using uv (recommended - 10-100x faster than pip!)
uv venv --python 3.10
uv sync

# Or using traditional pip
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install fastapi uvicorn langchain langchain-openai chromadb python-dotenv langsmith

Project Structure#

rag-api/
├── main.py           # FastAPI app + RAG logic
├── documents/        # Your .txt files go here
├── .env             # API keys
└── pyproject.toml   # Dependencies (if using uv)

Configure Environment#

Create .env:

OPENAI_API_KEY=your_openai_api_key_here
LANGSMITH_API_KEY=your_langsmith_key_here
LANGSMITH_PROJECT=rag_project

Security tip: Add .env to .gitignore and never commit API keys!

Step 2: Understanding RAG Components#

Document Chunking#

Why chunk? Documents are too long for LLM context windows. We split them into smaller pieces.

RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Each chunk ~1000 characters
    chunk_overlap=200     # Overlap to preserve context
)

How it works:

Tries to split on paragraphs first (\n\n)
Falls back to sentences (.)
Falls back to words if needed
200-character overlap prevents cutting sentences in half

Embeddings & Vector Search#

Embeddings convert text to numbers (vectors) that capture semantic meaning:

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# "The cat sat on the mat" → [0.23, -0.45, 0.67, ...]
# "A feline rested on the rug" → [0.21, -0.43, 0.69, ...]
# ↑ Similar meaning = similar vectors

ChromaDB stores these vectors and enables similarity search:

from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(chunks, embeddings)

# Find documents similar to query
docs = vector_store.similarity_search("return policy", k=3)

RAG Prompt Template#

We give the LLM context and ask it to answer:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer based on the context provided."),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

Step 3: Building the API#

File: main.py

import os
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException
from langchain.chat_models import init_chat_model
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langsmith import Client as LangSmithClient
from langsmith import traceable
from pydantic import BaseModel

load_dotenv()

# Initialize FastAPI app
app = FastAPI(title="RAG Q&A API")

# Initialize LangSmith for observability
langsmith_client = LangSmithClient()

# Initialize the language model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Global vector store (will be populated during ingestion)
vector_store = None


# Document ingestion function
async def ingest_documents(directory: str):
    global vector_store
    documents = []

    # Load all .txt files from directory
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            loader = TextLoader(os.path.join(directory, filename))
            documents.extend(loader.load())

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = text_splitter.split_documents(documents)

    # Create vector store with embeddings
    vector_store = Chroma.from_documents(chunks, embeddings)
    print(f"Ingested {len(chunks)} document chunks into vector store.")


# Pydantic model for API input
class QueryRequest(BaseModel):
    query: str


# Prompt template for RAG
prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a helpful assistant. Answer the user's question based on the provided context. "
        "If the context doesn't contain relevant information, say so and provide a general answer."
    ),
    ("human", "Context: {context}\n\nQuestion: {question}")
])


# Query processing with LangSmith tracing
@traceable(run_type="chain", project_name=os.getenv("LANGSMITH_PROJECT", "rag_project"))
async def process_query(query: str):
    if vector_store is None:
        raise ValueError("Vector store not initialized. Please ingest documents first.")

    try:
        # Semantic search for relevant chunks
        docs = vector_store.similarity_search(query, k=3)
        context = "\n".join([doc.page_content for doc in docs])

        # Format prompt with context
        prompt = prompt_template.format_messages(context=context, question=query)

        # Get LLM response
        response = await llm.ainvoke(prompt)
        return response.content
    except Exception as e:
        print(f"Error processing query: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing query: {str(e)}")


# FastAPI endpoint for querying
@app.post("/query")
async def query_endpoint(request: QueryRequest):
    response = await process_query(request.query)
    return {"answer": response}


# FastAPI endpoint for ingesting documents
@app.post("/ingest")
async def ingest_endpoint(directory: str = "./documents"):
    try:
        await ingest_documents(directory)
        return {"message": f"Successfully ingested documents from {directory}"}
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Error ingesting documents: {str(e)}"
        )


# Health check endpoint
@app.get("/health")
async def health_check():
    return {"status": "healthy"}


# Run the application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Code Breakdown#

1. Initialization (Lines 1-33):

Load environment variables
Set up FastAPI app
Initialize LangSmith for tracing
Initialize LLM (GPT-4o-mini)
Initialize OpenAI embeddings
Create global vector_store variable

2. Document Ingestion (Lines 36-51):

Load all .txt files from specified directory
Split into chunks (1000 chars with 200 overlap)
Create ChromaDB vector store with embeddings
Store globally for query access

3. Query Processing (Lines 71-86):

Decorated with @traceable for LangSmith tracking
Performs similarity search (k=3 top results)
Combines retrieved chunks into context
Formats prompt with context + question
Calls LLM asynchronously
Returns answer

4. API Endpoints (Lines 89-111):

/query - User asks questions, gets answers
/ingest - Upload documents to vector store
/health - Simple health check

Step 4: Running Your RAG API#

Start the Server#

python main.py

Output:

INFO:     Started server process [12345]
INFO:     Uvicorn running on http://0.0.0.0:8000

Test with curl#

1. Ingest Documents

First, create a sample document:

mkdir documents
echo "Our return policy allows returns within 30 days of purchase with original receipt. Refunds are processed within 5-7 business days." > documents/policy.txt

Ingest it:

curl -X POST "http://localhost:8000/ingest"

Response:

{
  "message": "Successfully ingested documents from ./documents"
}

2. Query Your Documents

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the return policy?"}'

Response:

{
  "answer": "According to the provided context, our return policy allows returns within 30 days of purchase with the original receipt. Refunds are processed within 5-7 business days after the return is received."
}

3. Ask Questions Outside Context

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the capital of France?"}'

Response:

{
  "answer": "The provided context does not contain information about the capital of France. However, the capital of France is Paris."
}

Notice how it acknowledges when context doesn't help!

Step 5: LangSmith Observability#

What is LangSmith?#

LangSmith is LangChain's observability platform. It tracks:

Every query your RAG system processes
Retrieved documents for each query
LLM prompts and responses
Latency and token usage
Errors and exceptions

Viewing Traces#

Go to smith.langchain.com
Find your project (e.g., "rag_project")
Click on any trace to see:
- User query
- Retrieved chunks
- Full prompt sent to LLM
- LLM response
- Total latency

Example trace:

Process Query (2.3s)
├─ Similarity Search (0.1s)
│  └─ Retrieved 3 documents
├─ Format Prompt (0.01s)
└─ LLM Call (2.2s)
   ├─ Tokens: 450 input, 85 output
   └─ Cost: $0.002

Why This Matters#

Production debugging: When users report wrong answers, check:

Was the right context retrieved?
Did the LLM hallucinate?
Was the prompt formatted correctly?

Example debug scenario:

User: "Why did my question get a wrong answer?"
You: Check LangSmith trace
Finding: Similarity search returned unrelated chunks
Fix: Adjust chunk_size or improve embeddings

Extending Your RAG System#

1. Support More Document Types#

from langchain.document_loaders import PDFLoader, CSVLoader, UnstructuredHTMLLoader

def ingest_documents(directory: str):
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            loader = PDFLoader(os.path.join(directory, filename))
        elif filename.endswith(".csv"):
            loader = CSVLoader(os.path.join(directory, filename))
        elif filename.endswith(".html"):
            loader = UnstructuredHTMLLoader(os.path.join(directory, filename))
        elif filename.endswith(".txt"):
            loader = TextLoader(os.path.join(directory, filename))
        else:
            continue
        documents.extend(loader.load())

    # Rest of ingestion logic...

2. Add Metadata Filtering#

Store metadata with chunks:

from langchain.schema import Document

# Add metadata during ingestion
docs_with_metadata = [
    Document(
        page_content=chunk.page_content,
        metadata={
            "source": filename,
            "chunk_id": i,
            "category": "policy"  # Custom metadata
        }
    )
    for i, chunk in enumerate(chunks)
]

vector_store = Chroma.from_documents(docs_with_metadata, embeddings)

# Filter during search
docs = vector_store.similarity_search(
    query,
    k=3,
    filter={"category": "policy"}  # Only search policy documents
)

3. Hybrid Search (Keywords + Semantic)#

Combine BM25 (keyword search) with vector search:

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks)
vector_retriever = vector_store.as_retriever()

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # 40% keyword, 60% semantic
)

docs = ensemble_retriever.get_relevant_documents(query)

4. Re-ranking for Better Results#

Use a re-ranker to improve retrieved document quality:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever()
)

docs = compression_retriever.get_relevant_documents(query)

5. Streaming Responses#

Stream answers for better UX:

from fastapi.responses import StreamingResponse

@app.post("/query-stream")
async def query_stream(request: QueryRequest):
    async def generate():
        async for chunk in llm.astream(prompt):
            yield chunk.content

    return StreamingResponse(generate(), media_type="text/plain")

6. Persistent Vector Store#

Save ChromaDB to disk to avoid re-ingesting:

# Save
vector_store = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./chroma_db"
)
vector_store.persist()

# Load
vector_store = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Production Considerations#

1. Error Handling#

Add retries and better error messages:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential())
async def process_query_with_retry(query: str):
    return await process_query(query)

2. Rate Limiting#

Prevent API abuse:

from fastapi_limiter import FastAPILimiter
from fastapi_limiter.depends import RateLimiter

@app.post("/query", dependencies=[Depends(RateLimiter(times=10, seconds=60))])
async def query_endpoint(request: QueryRequest):
    # Max 10 requests per minute
    pass

3. Authentication#

Add API key auth:

from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer

security = HTTPBearer()

def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials != os.getenv("API_KEY"):
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.post("/query", dependencies=[Depends(verify_token)])
async def query_endpoint(request: QueryRequest):
    pass

4. Caching#

Cache frequent queries:

from functools import lru_cache

@lru_cache(maxsize=100)
def cached_similarity_search(query: str):
    return vector_store.similarity_search(query, k=3)

5. Async Document Ingestion#

Don't block API during ingestion:

from fastapi import BackgroundTasks

@app.post("/ingest")
async def ingest_endpoint(
    background_tasks: BackgroundTasks,
    directory: str = "./documents"
):
    background_tasks.add_task(ingest_documents, directory)
    return {"message": "Ingestion started in background"}

Real-World Use Cases#

1. Customer Support Chatbot#

# Ingest: Product manuals, FAQs, troubleshooting guides
# Query: "How do I reset my device?"
# Answer: Based on Section 3.2 of the user manual...

2. Legal Document Search#

# Ingest: Case law, contracts, regulations
# Query: "What are the requirements for terminating an employee in California?"
# Answer: According to California Labor Code Section 2922...

3. Research Assistant#

# Ingest: Academic papers, research notes
# Query: "Summarize findings on RAG performance improvements"
# Answer: Based on 3 papers in your collection, key findings include...

4. Code Documentation Q&A#

# Ingest: README files, API docs, code comments
# Query: "How do I authenticate with the API?"
# Answer: According to the API documentation, authentication requires...

Common Issues and Debugging#

Issue 1: No Relevant Context Retrieved#

Symptom: LLM says "context doesn't contain information" even though it should

Debug:

# Check what was retrieved
docs = vector_store.similarity_search(query, k=3)
for doc in docs:
    print(f"Chunk: {doc.page_content[:200]}")
    print(f"Similarity score: {doc.metadata.get('score', 'N/A')}")

Fixes:

Increase k (retrieve more chunks)
Reduce chunk_size (smaller, more focused chunks)
Try different embedding model
Add hybrid search (keywords + semantic)

Issue 2: Slow Queries#

Symptom: API takes 5+ seconds to respond

Profile:

import time

start = time.time()
docs = vector_store.similarity_search(query, k=3)
print(f"Search: {time.time() - start}s")

start = time.time()
response = await llm.ainvoke(prompt)
print(f"LLM: {time.time() - start}s")

Fixes:

Use persistent vector store (don't rebuild on startup)
Switch to faster LLM (gpt-3.5-turbo)
Cache frequent queries
Use async/await properly

Issue 3: Hallucinations#

Symptom: LLM makes up information not in context

Fix:

prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        "ONLY answer based on the context provided. "
        "If the context doesn't contain the answer, say 'I don't have information about that in my documents.'"
        "Never make up information."
    ),
    ("human", "Context: {context}\n\nQuestion: {question}")
])

Next Steps#

Beginner Projects#

Add web interface: Build a simple chat UI with Streamlit or Gradio
Multi-document support: Upload documents via API instead of file system
Citation tracking: Return which document chunks were used for the answer

Advanced Projects#

Multi-modal RAG: Add support for images and tables in documents
Conversational RAG: Maintain chat history and context across queries
Agentic RAG: Let AI decide when to search vs when to use general knowledge
Evaluation pipeline: Automatically test answer quality with metrics

Conclusion#

You've built a production-ready RAG Q&A API! Here's what you learned:

✅ Understand RAG architecture and why it's powerful ✅ Implement document ingestion with chunking and embeddings ✅ Build semantic search with ChromaDB vector database ✅ Create FastAPI endpoints for querying and ingestion ✅ Integrate LangSmith for observability and debugging ✅ Extend with advanced features like filtering and streaming

The power of RAG: You can now build AI systems that answer questions about ANY documents, from customer support to legal research!

Resources#

Source Code: GitHub - RAG Q&A API
Full AI CheatSheet Collection: AI_CheatSheet Repository
LangChain Docs: Official Documentation
LangSmith: Observability Platform

Try it yourself: Clone the repository and experiment!

git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git
cd AI_CheatSheet/langchain_langsmith_RAG
# Follow setup instructions in README

Questions? The code is open source - issues and PRs welcome!

Happy building! 🚀📚