Building a Production RAG Q&A API with LangChain and FastAPI: Complete Guide#
Ever wanted to build an AI that can answer questions about your own documents? That's exactly what Retrieval-Augmented Generation (RAG) does, and it's surprisingly straightforward with the right tools.
In this tutorial, I'll show you how to build a production-ready RAG Q&A API that:
- 📄 Ingests and chunks your documents automatically
- 🔍 Uses semantic search to find relevant context
- 🤖 Generates accurate answers with GPT-4o-mini
- 📊 Tracks everything with LangSmith observability
- ⚡ Exposes a fast REST API with FastAPI
By the end, you'll have a fully functional RAG system you can deploy and scale.
What is RAG and Why Do You Need It?#
The Problem with Pure LLMs#
Regular LLMs like GPT-4 are incredibly smart, but they have limitations:
- ❌ Knowledge cutoff (doesn't know recent information)
- ❌ No access to your private documents
- ❌ Can hallucinate when uncertain
- ❌ Can't cite sources from your data
Example: Ask GPT-4 about your company's internal policies, and it will guess or admit it doesn't know.
The RAG Solution#
RAG combines the best of both worlds:
User Question: "What's our return policy?"
↓
1. Search your documents for relevant sections
↓
2. Retrieve: "Products can be returned within 30 days..."
↓
3. Pass context + question to LLM
↓
4. LLM answers based on YOUR documents
↓
"According to your policy, products can be returned within 30 days of purchase with original receipt."
Benefits:
- ✅ Answers based on YOUR data
- ✅ Citable sources
- ✅ Reduced hallucinations
- ✅ Up-to-date information
- ✅ Works with private/proprietary knowledge
Real-World Use Cases#
| Use Case | What RAG Does |
|---|---|
| Customer Support | Answer questions from product manuals, FAQs, policies |
| Legal Research | Search case law, contracts, regulations |
| Medical Q&A | Query research papers, clinical guidelines |
| Internal Knowledge Base | Company wikis, documentation, SOPs |
| Technical Documentation | Code repositories, API docs, tutorials |
What We're Building#
RAG API Architecture#
Documents (./documents/)
↓
[Text Chunking & Embedding]
↓
ChromaDB Vector Store
↓
User Query → [Semantic Search] → Top 3 Relevant Chunks
↓
[GPT-4o-mini + Context] → Answer
↓
FastAPI Response
↓
LangSmith Tracking
Key Components#
- FastAPI - Fast, modern REST API framework
- LangChain - Document processing & LLM orchestration
- ChromaDB - Vector database for semantic search
- OpenAI Embeddings - Convert text to vectors
- GPT-4o-mini - Generate answers from context
- LangSmith - Observability and tracing
API Endpoints#
POST /ingest- Upload and process documentsPOST /query- Ask questions about your documentsGET /health- Health check
Prerequisites#
What you need:
- Python 3.10+
- OpenAI API key
- LangSmith API key (optional but recommended)
- Basic understanding of APIs
- 15 minutes to follow along
No prior RAG or vector database experience required!
Step 1: Project Setup#
Install Dependencies#
I use uv for fast dependency management:
# Using uv (recommended - 10-100x faster than pip!) uv venv --python 3.10 uv sync # Or using traditional pip python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install fastapi uvicorn langchain langchain-openai chromadb python-dotenv langsmith
Project Structure#
rag-api/
├── main.py # FastAPI app + RAG logic
├── documents/ # Your .txt files go here
├── .env # API keys
└── pyproject.toml # Dependencies (if using uv)
Configure Environment#
Create .env:
OPENAI_API_KEY=your_openai_api_key_here LANGSMITH_API_KEY=your_langsmith_key_here LANGSMITH_PROJECT=rag_project
Security tip: Add .env to .gitignore and never commit API keys!
Step 2: Understanding RAG Components#
Document Chunking#
Why chunk? Documents are too long for LLM context windows. We split them into smaller pieces.
RecursiveCharacterTextSplitter:
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Each chunk ~1000 characters chunk_overlap=200 # Overlap to preserve context )
How it works:
- Tries to split on paragraphs first (
\n\n) - Falls back to sentences (
.) - Falls back to words if needed
- 200-character overlap prevents cutting sentences in half
Embeddings & Vector Search#
Embeddings convert text to numbers (vectors) that capture semantic meaning:
from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings() # "The cat sat on the mat" → [0.23, -0.45, 0.67, ...] # "A feline rested on the rug" → [0.21, -0.43, 0.69, ...] # ↑ Similar meaning = similar vectors
ChromaDB stores these vectors and enables similarity search:
from langchain.vectorstores import Chroma vector_store = Chroma.from_documents(chunks, embeddings) # Find documents similar to query docs = vector_store.similarity_search("return policy", k=3)
RAG Prompt Template#
We give the LLM context and ask it to answer:
from langchain_core.prompts import ChatPromptTemplate prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant. Answer based on the context provided."), ("human", "Context: {context}\n\nQuestion: {question}") ])
Step 3: Building the API#
File: main.py
import os from dotenv import load_dotenv from fastapi import FastAPI, HTTPException from langchain.chat_models import init_chat_model from langchain.document_loaders import TextLoader from langchain.embeddings import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langsmith import Client as LangSmithClient from langsmith import traceable from pydantic import BaseModel load_dotenv() # Initialize FastAPI app app = FastAPI(title="RAG Q&A API") # Initialize LangSmith for observability langsmith_client = LangSmithClient() # Initialize the language model llm = init_chat_model("gpt-4o-mini", model_provider="openai") # Initialize embeddings embeddings = OpenAIEmbeddings() # Global vector store (will be populated during ingestion) vector_store = None # Document ingestion function async def ingest_documents(directory: str): global vector_store documents = [] # Load all .txt files from directory for filename in os.listdir(directory): if filename.endswith(".txt"): loader = TextLoader(os.path.join(directory, filename)) documents.extend(loader.load()) # Split documents into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = text_splitter.split_documents(documents) # Create vector store with embeddings vector_store = Chroma.from_documents(chunks, embeddings) print(f"Ingested {len(chunks)} document chunks into vector store.") # Pydantic model for API input class QueryRequest(BaseModel): query: str # Prompt template for RAG prompt_template = ChatPromptTemplate.from_messages([ ( "system", "You are a helpful assistant. Answer the user's question based on the provided context. " "If the context doesn't contain relevant information, say so and provide a general answer." ), ("human", "Context: {context}\n\nQuestion: {question}") ]) # Query processing with LangSmith tracing @traceable(run_type="chain", project_name=os.getenv("LANGSMITH_PROJECT", "rag_project")) async def process_query(query: str): if vector_store is None: raise ValueError("Vector store not initialized. Please ingest documents first.") try: # Semantic search for relevant chunks docs = vector_store.similarity_search(query, k=3) context = "\n".join([doc.page_content for doc in docs]) # Format prompt with context prompt = prompt_template.format_messages(context=context, question=query) # Get LLM response response = await llm.ainvoke(prompt) return response.content except Exception as e: print(f"Error processing query: {str(e)}") raise HTTPException(status_code=500, detail=f"Error processing query: {str(e)}") # FastAPI endpoint for querying @app.post("/query") async def query_endpoint(request: QueryRequest): response = await process_query(request.query) return {"answer": response} # FastAPI endpoint for ingesting documents @app.post("/ingest") async def ingest_endpoint(directory: str = "./documents"): try: await ingest_documents(directory) return {"message": f"Successfully ingested documents from {directory}"} except Exception as e: raise HTTPException( status_code=500, detail=f"Error ingesting documents: {str(e)}" ) # Health check endpoint @app.get("/health") async def health_check(): return {"status": "healthy"} # Run the application if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)
Code Breakdown#
1. Initialization (Lines 1-33):
- Load environment variables
- Set up FastAPI app
- Initialize LangSmith for tracing
- Initialize LLM (GPT-4o-mini)
- Initialize OpenAI embeddings
- Create global vector_store variable
2. Document Ingestion (Lines 36-51):
- Load all
.txtfiles from specified directory - Split into chunks (1000 chars with 200 overlap)
- Create ChromaDB vector store with embeddings
- Store globally for query access
3. Query Processing (Lines 71-86):
- Decorated with
@traceablefor LangSmith tracking - Performs similarity search (k=3 top results)
- Combines retrieved chunks into context
- Formats prompt with context + question
- Calls LLM asynchronously
- Returns answer
4. API Endpoints (Lines 89-111):
/query- User asks questions, gets answers/ingest- Upload documents to vector store/health- Simple health check
Step 4: Running Your RAG API#
Start the Server#
python main.py
Output:
INFO: Started server process [12345]
INFO: Uvicorn running on http://0.0.0.0:8000
Test with curl#
1. Ingest Documents
First, create a sample document:
mkdir documents echo "Our return policy allows returns within 30 days of purchase with original receipt. Refunds are processed within 5-7 business days." > documents/policy.txt
Ingest it:
curl -X POST "http://localhost:8000/ingest"
Response:
{ "message": "Successfully ingested documents from ./documents" }
2. Query Your Documents
curl -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{"query": "What is the return policy?"}'
Response:
{ "answer": "According to the provided context, our return policy allows returns within 30 days of purchase with the original receipt. Refunds are processed within 5-7 business days after the return is received." }
3. Ask Questions Outside Context
curl -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{"query": "What is the capital of France?"}'
Response:
{ "answer": "The provided context does not contain information about the capital of France. However, the capital of France is Paris." }
Notice how it acknowledges when context doesn't help!
Step 5: LangSmith Observability#
What is LangSmith?#
LangSmith is LangChain's observability platform. It tracks:
- Every query your RAG system processes
- Retrieved documents for each query
- LLM prompts and responses
- Latency and token usage
- Errors and exceptions
Viewing Traces#
- Go to smith.langchain.com
- Find your project (e.g., "rag_project")
- Click on any trace to see:
- User query
- Retrieved chunks
- Full prompt sent to LLM
- LLM response
- Total latency
Example trace:
Process Query (2.3s)
├─ Similarity Search (0.1s)
│ └─ Retrieved 3 documents
├─ Format Prompt (0.01s)
└─ LLM Call (2.2s)
├─ Tokens: 450 input, 85 output
└─ Cost: $0.002
Why This Matters#
Production debugging: When users report wrong answers, check:
- Was the right context retrieved?
- Did the LLM hallucinate?
- Was the prompt formatted correctly?
Example debug scenario:
- User: "Why did my question get a wrong answer?"
- You: Check LangSmith trace
- Finding: Similarity search returned unrelated chunks
- Fix: Adjust chunk_size or improve embeddings
Extending Your RAG System#
1. Support More Document Types#
from langchain.document_loaders import PDFLoader, CSVLoader, UnstructuredHTMLLoader def ingest_documents(directory: str): documents = [] for filename in os.listdir(directory): if filename.endswith(".pdf"): loader = PDFLoader(os.path.join(directory, filename)) elif filename.endswith(".csv"): loader = CSVLoader(os.path.join(directory, filename)) elif filename.endswith(".html"): loader = UnstructuredHTMLLoader(os.path.join(directory, filename)) elif filename.endswith(".txt"): loader = TextLoader(os.path.join(directory, filename)) else: continue documents.extend(loader.load()) # Rest of ingestion logic...
2. Add Metadata Filtering#
Store metadata with chunks:
from langchain.schema import Document # Add metadata during ingestion docs_with_metadata = [ Document( page_content=chunk.page_content, metadata={ "source": filename, "chunk_id": i, "category": "policy" # Custom metadata } ) for i, chunk in enumerate(chunks) ] vector_store = Chroma.from_documents(docs_with_metadata, embeddings) # Filter during search docs = vector_store.similarity_search( query, k=3, filter={"category": "policy"} # Only search policy documents )
3. Hybrid Search (Keywords + Semantic)#
Combine BM25 (keyword search) with vector search:
from langchain.retrievers import EnsembleRetriever from langchain.retrievers import BM25Retriever bm25_retriever = BM25Retriever.from_documents(chunks) vector_retriever = vector_store.as_retriever() ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] # 40% keyword, 60% semantic ) docs = ensemble_retriever.get_relevant_documents(query)
4. Re-ranking for Better Results#
Use a re-ranker to improve retrieved document quality:
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import CohereRerank compressor = CohereRerank() compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vector_store.as_retriever() ) docs = compression_retriever.get_relevant_documents(query)
5. Streaming Responses#
Stream answers for better UX:
from fastapi.responses import StreamingResponse @app.post("/query-stream") async def query_stream(request: QueryRequest): async def generate(): async for chunk in llm.astream(prompt): yield chunk.content return StreamingResponse(generate(), media_type="text/plain")
6. Persistent Vector Store#
Save ChromaDB to disk to avoid re-ingesting:
# Save vector_store = Chroma.from_documents( chunks, embeddings, persist_directory="./chroma_db" ) vector_store.persist() # Load vector_store = Chroma( persist_directory="./chroma_db", embedding_function=embeddings )
Production Considerations#
1. Error Handling#
Add retries and better error messages:
from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential()) async def process_query_with_retry(query: str): return await process_query(query)
2. Rate Limiting#
Prevent API abuse:
from fastapi_limiter import FastAPILimiter from fastapi_limiter.depends import RateLimiter @app.post("/query", dependencies=[Depends(RateLimiter(times=10, seconds=60))]) async def query_endpoint(request: QueryRequest): # Max 10 requests per minute pass
3. Authentication#
Add API key auth:
from fastapi import Security, HTTPException from fastapi.security import HTTPBearer security = HTTPBearer() def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)): if credentials.credentials != os.getenv("API_KEY"): raise HTTPException(status_code=401, detail="Invalid API key") @app.post("/query", dependencies=[Depends(verify_token)]) async def query_endpoint(request: QueryRequest): pass
4. Caching#
Cache frequent queries:
from functools import lru_cache @lru_cache(maxsize=100) def cached_similarity_search(query: str): return vector_store.similarity_search(query, k=3)
5. Async Document Ingestion#
Don't block API during ingestion:
from fastapi import BackgroundTasks @app.post("/ingest") async def ingest_endpoint( background_tasks: BackgroundTasks, directory: str = "./documents" ): background_tasks.add_task(ingest_documents, directory) return {"message": "Ingestion started in background"}
Real-World Use Cases#
1. Customer Support Chatbot#
# Ingest: Product manuals, FAQs, troubleshooting guides # Query: "How do I reset my device?" # Answer: Based on Section 3.2 of the user manual...
2. Legal Document Search#
# Ingest: Case law, contracts, regulations # Query: "What are the requirements for terminating an employee in California?" # Answer: According to California Labor Code Section 2922...
3. Research Assistant#
# Ingest: Academic papers, research notes # Query: "Summarize findings on RAG performance improvements" # Answer: Based on 3 papers in your collection, key findings include...
4. Code Documentation Q&A#
# Ingest: README files, API docs, code comments # Query: "How do I authenticate with the API?" # Answer: According to the API documentation, authentication requires...
Common Issues and Debugging#
Issue 1: No Relevant Context Retrieved#
Symptom: LLM says "context doesn't contain information" even though it should
Debug:
# Check what was retrieved docs = vector_store.similarity_search(query, k=3) for doc in docs: print(f"Chunk: {doc.page_content[:200]}") print(f"Similarity score: {doc.metadata.get('score', 'N/A')}")
Fixes:
- Increase
k(retrieve more chunks) - Reduce
chunk_size(smaller, more focused chunks) - Try different embedding model
- Add hybrid search (keywords + semantic)
Issue 2: Slow Queries#
Symptom: API takes 5+ seconds to respond
Profile:
import time start = time.time() docs = vector_store.similarity_search(query, k=3) print(f"Search: {time.time() - start}s") start = time.time() response = await llm.ainvoke(prompt) print(f"LLM: {time.time() - start}s")
Fixes:
- Use persistent vector store (don't rebuild on startup)
- Switch to faster LLM (gpt-3.5-turbo)
- Cache frequent queries
- Use async/await properly
Issue 3: Hallucinations#
Symptom: LLM makes up information not in context
Fix:
prompt_template = ChatPromptTemplate.from_messages([ ( "system", "ONLY answer based on the context provided. " "If the context doesn't contain the answer, say 'I don't have information about that in my documents.'" "Never make up information." ), ("human", "Context: {context}\n\nQuestion: {question}") ])
Next Steps#
Beginner Projects#
- Add web interface: Build a simple chat UI with Streamlit or Gradio
- Multi-document support: Upload documents via API instead of file system
- Citation tracking: Return which document chunks were used for the answer
Advanced Projects#
- Multi-modal RAG: Add support for images and tables in documents
- Conversational RAG: Maintain chat history and context across queries
- Agentic RAG: Let AI decide when to search vs when to use general knowledge
- Evaluation pipeline: Automatically test answer quality with metrics
Conclusion#
You've built a production-ready RAG Q&A API! Here's what you learned:
✅ Understand RAG architecture and why it's powerful ✅ Implement document ingestion with chunking and embeddings ✅ Build semantic search with ChromaDB vector database ✅ Create FastAPI endpoints for querying and ingestion ✅ Integrate LangSmith for observability and debugging ✅ Extend with advanced features like filtering and streaming
The power of RAG: You can now build AI systems that answer questions about ANY documents, from customer support to legal research!
Resources#
- Source Code: GitHub - RAG Q&A API
- Full AI CheatSheet Collection: AI_CheatSheet Repository
- LangChain Docs: Official Documentation
- LangSmith: Observability Platform
Try it yourself: Clone the repository and experiment!
git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git cd AI_CheatSheet/langchain_langsmith_RAG # Follow setup instructions in README
Questions? The code is open source - issues and PRs welcome!
Happy building! 🚀📚