Fine-Tuning LLMs with RLHF: A Production-Ready Guide to DPO, PPO, and Beyond#
Ever wondered how ChatGPT became so helpful and harmless? The secret is RLHF (Reinforcement Learning from Human Feedback). While base models like GPT can generate text, they don't know what humans actually want. RLHF teaches them to be helpful, truthful, and safe.
In this comprehensive guide, I'll show you how to fine-tune any LLM using multiple RLHF methods:
- 🎯 DPO (Direct Preference Optimization) - Easiest and most stable
- 🎮 PPO (Proximal Policy Optimization) - Classic RL approach
- 🔄 Online DPO - Iterative self-improvement
- 👥 GRPO (Group Relative Policy Optimization) - Most efficient
Plus production features: distributed training, memory optimization (QLoRA), experiment tracking, and CLI/Python API.
Why RLHF? The Problem Base Models Have#
Base Models Are Unaligned#
Base LLMs (trained only on next-token prediction) have serious issues:
User: "How do I make a cake?"
Base Model (unaligned):
"How do I make a cake more delicious? How do I make a cake without eggs?
The history of cake making dates back to ancient Egypt..."
Problems:
- ❌ Doesn't directly answer the question
- ❌ Continues the pattern instead of being helpful
- ❌ No sense of completion
- ❌ Can generate harmful content
After RLHF Alignment#
User: "How do I make a cake?"
RLHF Model (aligned):
"Here's a simple vanilla cake recipe:
Ingredients:
- 2 cups flour
- 1.5 cups sugar
- 1/2 cup butter...
Instructions:
1. Preheat oven to 350°F...
2. Mix dry ingredients...
Would you like variations or tips?"
Benefits:
- ✅ Directly helpful
- ✅ Well-structured
- ✅ Knows when to stop
- ✅ Asks clarifying questions
- ✅ Safer outputs
What is RLHF?#
RLHF = Training LLMs to maximize human preferences
The Three-Step Process#
Step 1: Supervised Fine-Tuning (SFT)
├─ Start with base model
├─ Fine-tune on high-quality demonstrations
└─ Creates decent but not optimal model
Step 2: Reward Model Training (optional)
├─ Collect human preference data
├─ Train model to predict human preferences
└─ Used to score model outputs
Step 3: Reinforcement Learning
├─ Generate responses
├─ Score with reward model or preferences
├─ Update policy to maximize rewards
└─ Aligned model! 🎉
Key Insight#
RLHF shifts from "predicting next token" to "maximizing human satisfaction"
Instead of asking "What comes next?", we ask "What would humans prefer?"
Introducing LLM-RL: Production-Ready RLHF Framework#
I built a comprehensive framework that makes RLHF accessible and production-ready.
Core Features#
| Feature | Description |
|---|---|
| 4 RL Methods | DPO, PPO, Online DPO, GRPO |
| Memory Efficient | LoRA, QLoRA (4-bit/8-bit quantization) |
| Distributed Training | DeepSpeed ZeRO-2/3, FSDP support |
| Auto-Download Datasets | 4+ popular preference datasets |
| Experiment Tracking | Weights & Biases, TensorBoard |
| CLI & Python API | Use via command-line or as library |
| Production-Ready | Error handling, logging, validation |
Why This Framework?#
Problem with existing tools:
- Research code that breaks in production
- Hard to switch between RL methods
- Poor memory management (OOM errors)
- No proper configuration management
- Missing experiment tracking
LLM-RL solves this:
- Clean, modular architecture
- Easy method comparison
- Automatic memory optimization
- YAML configs with validation
- Built-in observability
Understanding the 4 RL Methods#
1. DPO (Direct Preference Optimization)#
Best for: Getting started, stable training, most use cases
How it works:
Given: Prompt + Two responses (chosen vs rejected)
Example:
Prompt: "Explain quantum computing"
✅ Chosen: "Quantum computing uses quantum bits (qubits) that can be 0, 1, or both..."
❌ Rejected: "Quantum computing is complicated stuff about particles and waves and..."
DPO: Directly optimizes model to prefer chosen over rejected
└─ No reward model needed!
└─ Most stable convergence
└─ Easiest to implement
Advantages:
- ✅ No reward model training required
- ✅ Most stable and reliable
- ✅ Simpler than PPO
- ✅ Works well for most cases
When to use:
- First time doing RLHF
- Want stable, predictable training
- Have pairwise preference data
- Don't want complexity of PPO
2. PPO (Proximal Policy Optimization)#
Best for: Maximum control, research, custom reward functions
How it works:
Step 1: Train reward model
├─ Learn to score outputs like humans
└─ Predict: "How good is this response?"
Step 2: Generate rollouts
├─ Model generates many responses
└─ Reward model scores each
Step 3: Policy optimization
├─ Update policy to maximize reward
└─ Clip updates to prevent collapse
Advantages:
- ✅ Most flexible (custom rewards)
- ✅ Classic RL approach
- ✅ Works with any reward signal
- ✅ Active research area
Challenges:
- ⚠️ Requires reward model training
- ⚠️ More complex than DPO
- ⚠️ Can be unstable
- ⚠️ Needs careful hyperparameter tuning
When to use:
- Need custom reward functions
- Have computational resources
- Want maximum control
- Doing research
3. Online DPO#
Best for: Iterative improvement, self-improvement, data efficiency
How it works:
Iteration 1:
├─ Generate responses from current model
├─ Rank with reward model (or heuristic)
├─ Create preference pairs
└─ Train with DPO
Iteration 2:
├─ Model improved, generates better responses
├─ Create harder preference pairs
└─ Train again
...repeat until converged
Advantages:
- ✅ Self-improving loop
- ✅ More data-efficient
- ✅ Adapts to model's current level
- ✅ Can bootstrap from small dataset
When to use:
- Want iterative improvement
- Limited initial preference data
- Need active learning
- Want self-improving systems
4. GRPO (Group Relative Policy Optimization)#
Best for: Sample efficiency, batch preference learning
How it works:
Traditional DPO: Compare 2 responses pairwise
Prompt → [Response A vs Response B]
GRPO: Compare multiple responses at once
Prompt → [Response A, B, C, D]
└─ Rank all responses together
└─ Learn from group preferences
└─ More efficient than pairwise
Advantages:
- ✅ More sample-efficient
- ✅ Better use of compute
- ✅ Learns from multiple comparisons
- ✅ Reduces preference data needs
When to use:
- Want maximum efficiency
- Can generate multiple responses per prompt
- Have group preference data
- Limited labeled data
Quick Comparison: Which Method to Use?#
| Method | Difficulty | Stability | Efficiency | Best For |
|---|---|---|---|---|
| DPO | ⭐ Easy | ⭐⭐⭐ High | ⭐⭐ Good | Most users, getting started |
| PPO | ⭐⭐⭐ Hard | ⭐ Moderate | ⭐⭐ Good | Research, custom rewards |
| Online DPO | ⭐⭐ Medium | ⭐⭐ Good | ⭐⭐⭐ High | Iterative improvement |
| GRPO | ⭐⭐ Medium | ⭐⭐ Good | ⭐⭐⭐ Highest | Maximum efficiency |
My recommendation: Start with DPO, then explore others once you understand the basics.
Prerequisites#
What you need:
- Python 3.10+
- GPU with 8GB+ VRAM (or use QLoRA for 4GB)
- Basic understanding of LLMs and fine-tuning
- OpenAI API key (for some datasets)
- 30 minutes to follow along
Optional but recommended:
- Weights & Biases account (for tracking)
- Multi-GPU setup (for larger models)
Step 1: Installation and Setup#
Using UV (Recommended - 10-100x faster!)#
# Clone the repository git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git cd AI_CheatSheet/llm-rl # Install with UV uv sync # Activate virtual environment source .venv/bin/activate # Linux/Mac # or .venv\Scripts\activate # Windows
Using Traditional pip#
# Create virtual environment python -m venv venv source venv/bin/activate # Install pip install -e .
Verify Installation#
# Check CLI is available llm-rl --help # Should show: # Usage: llm-rl [OPTIONS] COMMAND [ARGS]... # Commands: # train Train a model with RL # train-reward Train a reward model # download-dataset Download a dataset # list-datasets List available datasets
Check GPU Availability#
Before training, verify your GPU setup:
# Check if CUDA is available python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" # List all GPUs nvidia-smi # Check GPU details in Python python -c " import torch print(f'Number of GPUs: {torch.cuda.device_count()}') for i in range(torch.cuda.device_count()): print(f'GPU {i}: {torch.cuda.get_device_name(i)}') print(f' Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB') "
Output example:
CUDA available: True
Number of GPUs: 2
GPU 0: NVIDIA RTX 4090
Memory: 24.00 GB
GPU 1: NVIDIA RTX 3090
Memory: 24.00 GB
Select which GPU(s) to use:
# Use only GPU 0 export CUDA_VISIBLE_DEVICES=0 # Use GPU 1 only export CUDA_VISIBLE_DEVICES=1 # Use both GPUs export CUDA_VISIBLE_DEVICES=0,1
Step 2: Understanding Preference Datasets#
What is Preference Data?#
Format:
{ "prompt": "Explain machine learning to a 5-year-old", "chosen": "Machine learning is like teaching a robot by showing it lots of examples...", "rejected": "Machine learning is a subset of artificial intelligence that utilizes..." }
Key insight: We need pairs of responses (better vs worse) for the same prompt.
Available Datasets#
Our framework auto-downloads these datasets:
| Dataset | Size | Domain | Best For |
|---|---|---|---|
| ultrafeedback-binarized | 64k | General | Getting started |
| anthropic-hh-rlhf | 160k | Helpfulness/Safety | Production models |
| stack-exchange-preferences | 10M | Technical Q&A | Domain-specific |
| summarize-from-feedback | 90k | Summarization | Specific tasks |
Download a Dataset#
# List available datasets llm-rl list-datasets # Download dataset llm-rl download-dataset ultrafeedback-binarized --output ./data # Or download with Python python examples/download_datasets.py
Dataset will be cached in ~/.cache/huggingface/datasets/ for reuse.
Step 3: Your First RLHF Training with DPO#
Let's start with DPO - the easiest and most stable method.
Option 1: Using CLI (Quickest)#
# Train with default config llm-rl train --config configs/training/dpo/base.yaml # Output: # Loading model: gpt2 # Loading dataset: ultrafeedback-binarized # Starting DPO training... # Epoch 1/1: 100%|████████| 625/625 [15:23<00:00] # Training complete!
GPU Selection: If you have multiple GPUs, specify which one to use:
# Use GPU 0 export CUDA_VISIBLE_DEVICES=0 llm-rl train --config configs/training/dpo/base.yaml # Use GPU 1 export CUDA_VISIBLE_DEVICES=1 llm-rl train --config configs/training/dpo/base.yaml # Use multiple GPUs (0 and 1) export CUDA_VISIBLE_DEVICES=0,1 llm-rl train --config configs/training/dpo/base.yaml
Option 2: Using Python API (More Control)#
File: train_dpo.py
from llm_rl.config import DPOConfig from llm_rl.trainers import train_dpo # Load config from YAML config = DPOConfig.from_yaml("configs/training/dpo/base.yaml") # Train metrics = train_dpo(config) print(f"Training complete! Final metrics: {metrics}")
Run it:
# Default (uses all available GPUs) python train_dpo.py # Use specific GPU export CUDA_VISIBLE_DEVICES=0 python train_dpo.py
Option 3: Programmatic Config (Maximum Flexibility)#
from llm_rl.config import ( DPOConfig, ModelConfig, PeftConfig, DatasetConfig, TrainingConfig ) config = DPOConfig( method="dpo", # Model settings model=ModelConfig( model_name_or_path="gpt2", use_peft=True, load_in_4bit=False, # Enable for low memory torch_dtype="bfloat16", ), # LoRA settings (parameter-efficient fine-tuning) peft=PeftConfig( r=16, # LoRA rank lora_alpha=32, # LoRA alpha lora_dropout=0.05, ), # Dataset settings dataset=DatasetConfig( dataset_name="ultrafeedback-binarized", max_train_samples=10000, # Use subset for fast iteration max_length=512, ), # Training settings training=TrainingConfig( output_dir="./outputs/my_dpo_model", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=4, # Effective batch size = 16 learning_rate=5e-5, bf16=True, # Use bfloat16 for faster training gradient_checkpointing=True, # Save memory ), # DPO-specific hyperparameters beta=0.1, # Controls strength of preference optimization ) # Train metrics = train_dpo(config)
Understanding Key Parameters#
beta (DPO-specific):
- Controls how strongly to optimize for preferences
- Higher = stronger preference signal
- Lower = stays closer to reference model
- Default: 0.1 (works well for most cases)
- Range: 0.01 - 0.5
lora_alpha / r (LoRA):
r: Rank of LoRA matrices (lower = fewer parameters)lora_alpha: Scaling factor- Rule of thumb:
lora_alpha = 2 * r - Common:
r=16, lora_alpha=32
gradient_accumulation_steps:
- Accumulates gradients over N steps before updating
- Effective batch size =
batch_size * gradient_accumulation_steps * num_gpus - Use to simulate larger batches with limited memory
Step 4: Understanding the Training Process#
What Happens During Training#
Iteration 1:
├─ Load batch of preference pairs
├─ Forward pass: Get log probabilities
│ ├─ Policy model: P(chosen | prompt)
│ └─ Policy model: P(rejected | prompt)
├─ Compute DPO loss
│ └─ Loss = log(sigmoid(beta * (log_pi_chosen - log_pi_rejected)))
├─ Backward pass: Compute gradients
└─ Update model weights
Iteration 2:
...repeat
After N iterations:
└─ Model learns to prefer chosen over rejected responses
Monitoring Training#
TensorBoard (default):
# In another terminal tensorboard --logdir outputs/dpo_gpt2/runs
Metrics to watch:
train/loss: Should decrease over timetrain/rewards/chosen: Should increasetrain/rewards/rejected: Should decreasetrain/rewards/margins: Should increase (chosen - rejected)
Good training:
Epoch 1: loss=0.65, margin=0.15
Epoch 2: loss=0.52, margin=0.28
Epoch 3: loss=0.41, margin=0.42 ✅ margin increasing
Bad training:
Epoch 1: loss=0.65, margin=0.15
Epoch 2: loss=0.58, margin=0.12
Epoch 3: loss=0.61, margin=0.08 ❌ margin decreasing (model forgetting)
Step 5: Training PPO (Advanced)#
PPO requires a reward model first. Let's train one!
Step 5.1: Train Reward Model#
File: train_reward_model.py
from llm_rl.config import RewardModelConfig from llm_rl.trainers import train_reward_model config = RewardModelConfig.from_yaml("configs/reward_model.yaml") # Train reward model metrics = train_reward_model(config) # Saves to: ./outputs/reward_model/
What it does:
- Takes preference data
- Trains model to predict which response is better
- Outputs scalar score: higher = better response
Step 5.2: Train with PPO#
File: train_ppo.py
from llm_rl.config import PPOConfig from llm_rl.trainers import train_ppo config = PPOConfig( method="ppo", model=ModelConfig( model_name_or_path="gpt2", use_peft=True, ), # Point to trained reward model reward_model_path="./outputs/reward_model/checkpoint-final", # PPO-specific parameters kl_penalty="kl", # KL divergence penalty type adap_kl_ctrl=True, # Adaptive KL coefficient init_kl_coef=0.2, # Initial KL coefficient target_kl=6.0, # Target KL divergence ) metrics = train_ppo(config)
PPO Training Loop:
For each batch:
├─ Generate responses from current policy
├─ Score responses with reward model
├─ Compute advantages
├─ Update policy with PPO loss
│ ├─ Policy loss
│ ├─ Value loss
│ └─ KL penalty (stay close to reference)
└─ Repeat
Step 6: Memory Optimization for Large Models#
Problem: Out of Memory (OOM)#
Training LLaMA-2 7B:
# Without optimization
Full precision: ~28GB VRAM ❌ Doesn't fit on consumer GPU
# With optimizations
QLoRA + Gradient Checkpointing: ~6GB VRAM ✅ Fits on RTX 3090!
Technique 1: QLoRA (4-bit Quantization)#
Config:
model: model_name_or_path: "meta-llama/Llama-2-7b-hf" load_in_4bit: true # ✅ Enable 4-bit quantization bnb_4bit_compute_dtype: "bfloat16" bnb_4bit_quant_type: "nf4" use_peft: true peft: r: 64 # Can use higher rank with QLoRA lora_alpha: 128
Memory savings:
- Full model: 4 bytes/parameter
- 4-bit: 0.5 bytes/parameter
- 8x memory reduction!
Technique 2: Gradient Checkpointing#
Config:
training: gradient_checkpointing: true # ✅ Enable
Trade-off:
- Memory: ~40% reduction
- Speed: ~20% slower (recomputes activations)
- Worth it for large models!
Technique 3: Gradient Accumulation#
Config:
training: per_device_train_batch_size: 2 # Small batch gradient_accumulation_steps: 16 # Accumulate # Effective batch size = 2 * 16 = 32
Technique 4: DeepSpeed ZeRO#
For multi-GPU training:
training: deepspeed: "configs/deepspeed/zero3.json"
Run with multiple GPUs:
# Use GPUs 0, 1, 2, 3 export CUDA_VISIBLE_DEVICES=0,1,2,3 llm-rl train --config configs/training/dpo/base.yaml # Or specify in one line CUDA_VISIBLE_DEVICES=0,1,2,3 llm-rl train --config configs/training/dpo/base.yaml
ZeRO-3 partitions:
- Optimizer states
- Gradients
- Model parameters
Result: Train 175B models on 8x A100!
Step 7: Experiment Tracking with Weights & Biases#
Setup#
logging: wandb_project: "llm-rlhf-experiments" wandb_entity: "my-team" wandb_run_name: "dpo-llama2-7b" report_to: ["wandb", "tensorboard"]
# Login to W&B wandb login
What Gets Tracked#
- Training/validation loss curves
- Reward margins (chosen vs rejected)
- Learning rate schedule
- GPU utilization
- Model checkpoints
- Hyperparameters
- System metrics
Compare Experiments#
import wandb # Load multiple runs api = wandb.Api() runs = api.runs("my-project") # Compare DPO vs PPO dpo_run = [r for r in runs if "dpo" in r.name][0] ppo_run = [r for r in runs if "ppo" in r.name][0] print(f"DPO final loss: {dpo_run.summary['train/loss']}") print(f"PPO final reward: {ppo_run.summary['train/reward']}")
Step 8: Evaluation and Testing#
Qualitative Evaluation#
Test your model:
from transformers import AutoModelForCausalLM, AutoTokenizer # Load fine-tuned model model = AutoModelForCausalLM.from_pretrained("./outputs/dpo_gpt2/checkpoint-final") tokenizer = AutoTokenizer.from_pretrained("gpt2") # Test prompt prompt = "Explain quantum computing to a 5-year-old:" inputs = tokenizer(prompt, return_tensors="pt") # Generate outputs = model.generate( **inputs, max_length=200, do_sample=True, top_p=0.95, temperature=0.7, ) print(tokenizer.decode(outputs[0]))
A/B Testing#
Compare base vs fine-tuned:
base_model = AutoModelForCausalLM.from_pretrained("gpt2") rlhf_model = AutoModelForCausalLM.from_pretrained("./outputs/dpo_gpt2/checkpoint-final") prompts = [ "Write a poem about AI:", "Explain recursion:", "Write a story about a robot:", ] for prompt in prompts: print(f"\n=== Prompt: {prompt} ===") # Base model base_out = generate(base_model, prompt) print(f"Base: {base_out}") # RLHF model rlhf_out = generate(rlhf_model, prompt) print(f"RLHF: {rlhf_out}")
Quantitative Metrics#
MT-Bench, AlpacaEval, etc.:
# Coming soon in framework llm-rl evaluate \ --checkpoint ./outputs/dpo_gpt2/checkpoint-final \ --benchmark mt-bench \ --output results.json
Advanced: Online DPO for Self-Improvement#
Iterative training loop:
from llm_rl.config import OnlineDPOConfig from llm_rl.trainers import train_online_dpo config = OnlineDPOConfig( method="online_dpo", # Number of iterations num_iterations=3, # Generate N responses per prompt num_generations_per_prompt=4, # Use reward model to rank reward_model_path="./outputs/reward_model", # Rest same as DPO model=ModelConfig(...), dataset=DatasetConfig(...), ) metrics = train_online_dpo(config)
What happens:
Iteration 1:
├─ Generate responses from current model
├─ Rank with reward model → create preferences
└─ Train with DPO on new preferences
Iteration 2:
├─ Model is better, generates better responses
├─ Create harder preference pairs
└─ Train again (self-improvement!)
Iteration 3:
└─ Repeat...
Benefits:
- Adapts to model's improving capability
- Generates progressively harder examples
- More data-efficient
- Self-improving loop
Advanced: GRPO for Maximum Efficiency#
Group preferences instead of pairwise:
from llm_rl.config import GRPOConfig from llm_rl.trainers import train_grpo config = GRPOConfig( method="grpo", # Generate multiple responses group_size=8, # Compare 8 responses at once # Ranking strategy ranking_strategy="reward_model", # or "random", "curriculum" model=ModelConfig(...), dataset=DatasetConfig(...), ) metrics = train_grpo(config)
Efficiency gain:
- Pairwise DPO: Need N×(N-1)/2 comparisons for N responses
- GRPO: Single group ranking
- Up to 4x more sample-efficient!
Production Best Practices#
1. Start Small, Scale Up#
# Iteration 1: Fast prototyping config = DPOConfig( model=ModelConfig(model_name_or_path="gpt2"), # Small model dataset=DatasetConfig(max_train_samples=1000), # Small dataset training=TrainingConfig(num_train_epochs=1), ) # Iteration 2: Full training config = DPOConfig( model=ModelConfig(model_name_or_path="meta-llama/Llama-2-7b-hf"), dataset=DatasetConfig(max_train_samples=None), # Full dataset training=TrainingConfig(num_train_epochs=3), )
2. Checkpoint Management#
training: save_steps: 500 # Save every 500 steps save_total_limit: 3 # Keep only 3 checkpoints load_best_model_at_end: true # Load best checkpoint
3. Hyperparameter Tuning#
Key hyperparameters to tune:
| Parameter | Range | Impact |
|---|---|---|
beta | 0.01-0.5 | Preference strength |
learning_rate | 1e-6 to 1e-4 | Training speed/stability |
lora_r | 8-128 | Model capacity |
max_length | 256-2048 | Context size |
Use W&B Sweeps:
# sweep.yaml program: train_dpo.py method: bayes metric: name: eval/loss goal: minimize parameters: beta: min: 0.05 max: 0.3 learning_rate: min: 1e-6 max: 1e-4
wandb sweep sweep.yaml wandb agent <sweep-id>
4. Error Handling#
Common issues:
| Error | Solution |
|---|---|
| OOM | Enable 4-bit quantization, reduce batch size, enable gradient checkpointing |
| NaN loss | Lower learning rate, add gradient clipping, check data quality |
| Slow convergence | Increase learning rate, reduce beta, check data is balanced |
| Model collapse | Lower learning rate, increase KL penalty (PPO), reduce beta (DPO) |
Real-World Applications#
1. Customer Support Bots#
Problem: Generic responses, doesn't follow company tone
Solution: Fine-tune with RLHF on company's historical chats
# Prepare preference data preferences = [ { "prompt": "How do I return a product?", "chosen": "I'd be happy to help! Our return policy...", # Friendly "rejected": "Check the return policy on our website." # Cold }, ] # Train with DPO config = DPOConfig(...) train_dpo(config)
Result: Bot matches company tone and style
2. Code Assistants#
Problem: Generates code that doesn't follow best practices
Solution: RLHF on code preferences (clean vs messy)
preferences = [ { "prompt": "Write a function to calculate fibonacci", "chosen": "def fib(n):\n a, b = 0, 1\n for _ in range(n):\n a, b = b, a+b\n return a", "rejected": "def f(x):\n if x==0: return 0\n if x==1: return 1\n return f(x-1)+f(x-2)" # Inefficient recursion } ]
3. Content Moderation#
Problem: Model generates harmful/biased content
Solution: RLHF with safety preferences
preferences = [ { "prompt": "Write about...", "chosen": "[Safe, balanced response]", "rejected": "[Biased or harmful response]" } ]
Plus: Use PPO with custom safety reward model
4. Educational Tutors#
Problem: Explanations too complex or too simple
Solution: RLHF on explanation quality
preferences = [ { "prompt": "Explain photosynthesis", "chosen": "Photosynthesis is how plants make food using sunlight...", # Right level "rejected": "Photosynthesis (from Greek φῶς, phōs...) is a process..." # Too complex } ]
Troubleshooting Guide#
Issue 1: Training is Unstable#
Symptoms: Loss spikes, NaN values, model collapse
Solutions:
training: learning_rate: 1e-6 # Lower LR max_grad_norm: 0.5 # Stronger gradient clipping warmup_ratio: 0.1 # More warmup # DPO beta: 0.05 # Lower beta # PPO init_kl_coef: 0.5 # Higher KL penalty
Issue 2: Model Not Learning#
Symptoms: Loss doesn't decrease, margins stay flat
Debug:
# Check data quality from datasets import load_dataset ds = load_dataset("ultrafeedback-binarized", split="train") print(ds[0]) # Verify chosen != rejected assert ds[0]["chosen"] != ds[0]["rejected"]
Solutions:
- Check data is properly formatted
- Increase learning rate
- Train longer
- Reduce regularization
Issue 3: OOM on Multi-GPU#
Symptoms: Works on 1 GPU, OOM on multiple
Solution:
training: ddp_find_unused_parameters: false # Disable this gradient_checkpointing: true per_device_train_batch_size: 1 # Reduce per-GPU batch
Issue 4: GPU Not Detected or Using Wrong GPU#
Symptoms: Training runs on CPU, or uses wrong GPU
Check available GPUs:
# List all GPUs nvidia-smi # Check which GPUs PyTorch sees python -c "import torch; print(torch.cuda.device_count()); print(torch.cuda.get_device_name(0))"
Solutions:
# Select specific GPU export CUDA_VISIBLE_DEVICES=0 # Use GPU 0 python train_dpo.py # Select multiple GPUs export CUDA_VISIBLE_DEVICES=0,1 # Use GPU 0 and 1 # Make GPU 1 appear as GPU 0 to the script export CUDA_VISIBLE_DEVICES=1 python train_dpo.py # Script sees this as device 0 # Check current setting echo $CUDA_VISIBLE_DEVICES
Common scenarios:
- Shared server: Other users on GPU 0? Use
export CUDA_VISIBLE_DEVICES=1,2,3 - Memory issue: One GPU has less memory? Use
export CUDA_VISIBLE_DEVICES=1(skip GPU 0) - Testing: Want to test on smaller GPU first? Use
export CUDA_VISIBLE_DEVICES=3
Conclusion#
You've learned how to fine-tune LLMs with RLHF from scratch to production!
What You've Mastered#
✅ Understand RLHF: Why it's crucial for aligning LLMs ✅ 4 RL Methods: DPO, PPO, Online DPO, GRPO ✅ Production Features: QLoRA, distributed training, tracking ✅ Practical Skills: Train models, optimize memory, debug issues ✅ Real-World Applications: Customer support, code assistants, safety
Next Steps#
Beginner Projects:
- Fine-tune GPT-2 on UltraFeedback with DPO
- Compare DPO vs base model outputs
- Add W&B tracking to experiments
Intermediate Projects:
- Train LLaMA-2 7B with QLoRA
- Build custom preference dataset
- Implement Online DPO for self-improvement
Advanced Projects:
- Train reward model + PPO pipeline
- Multi-GPU training with DeepSpeed
- Deploy RLHF model to production
The Power of RLHF#
RLHF transforms generic language models into helpful, harmless, and honest assistants. Whether you're building chatbots, code assistants, or content generators, RLHF is essential for production AI.
Start experimenting today!
Resources#
- Source Code: GitHub - LLM-RL
- Full AI CheatSheet Collection: AI_CheatSheet Repository
- Research Papers:
- InstructGPT - Original RLHF paper
- DPO - Direct Preference Optimization
- PPO - Proximal Policy Optimization
- Datasets:
- Related Tools:
Clone and try it yourself:
git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git cd AI_CheatSheet/llm-rl uv sync source .venv/bin/activate llm-rl train --config configs/training/dpo/base.yaml
Questions? Open an issue on GitHub!
Happy fine-tuning! 🚀🤖