Fine-Tuning LLMs with RLHF: A Production-Ready Guide to DPO, PPO, and Beyond#

Ever wondered how ChatGPT became so helpful and harmless? The secret is RLHF (Reinforcement Learning from Human Feedback). While base models like GPT can generate text, they don't know what humans actually want. RLHF teaches them to be helpful, truthful, and safe.

In this comprehensive guide, I'll show you how to fine-tune any LLM using multiple RLHF methods:

🎯 DPO (Direct Preference Optimization) - Easiest and most stable
🎮 PPO (Proximal Policy Optimization) - Classic RL approach
🔄 Online DPO - Iterative self-improvement
👥 GRPO (Group Relative Policy Optimization) - Most efficient

Plus production features: distributed training, memory optimization (QLoRA), experiment tracking, and CLI/Python API.

Why RLHF? The Problem Base Models Have#

Base Models Are Unaligned#

Base LLMs (trained only on next-token prediction) have serious issues:

User: "How do I make a cake?"

Base Model (unaligned):
"How do I make a cake more delicious? How do I make a cake without eggs?
The history of cake making dates back to ancient Egypt..."

Problems:

❌ Doesn't directly answer the question
❌ Continues the pattern instead of being helpful
❌ No sense of completion
❌ Can generate harmful content

After RLHF Alignment#

User: "How do I make a cake?"

RLHF Model (aligned):
"Here's a simple vanilla cake recipe:

Ingredients:
- 2 cups flour
- 1.5 cups sugar
- 1/2 cup butter...

Instructions:
1. Preheat oven to 350°F...
2. Mix dry ingredients...

Would you like variations or tips?"

Benefits:

✅ Directly helpful
✅ Well-structured
✅ Knows when to stop
✅ Asks clarifying questions
✅ Safer outputs

What is RLHF?#

RLHF = Training LLMs to maximize human preferences

The Three-Step Process#

Step 1: Supervised Fine-Tuning (SFT)
├─ Start with base model
├─ Fine-tune on high-quality demonstrations
└─ Creates decent but not optimal model

Step 2: Reward Model Training (optional)
├─ Collect human preference data
├─ Train model to predict human preferences
└─ Used to score model outputs

Step 3: Reinforcement Learning
├─ Generate responses
├─ Score with reward model or preferences
├─ Update policy to maximize rewards
└─ Aligned model! 🎉

Key Insight#

RLHF shifts from "predicting next token" to "maximizing human satisfaction"

Instead of asking "What comes next?", we ask "What would humans prefer?"

Introducing LLM-RL: Production-Ready RLHF Framework#

I built a comprehensive framework that makes RLHF accessible and production-ready.

Core Features#

Feature	Description
4 RL Methods	DPO, PPO, Online DPO, GRPO
Memory Efficient	LoRA, QLoRA (4-bit/8-bit quantization)
Distributed Training	DeepSpeed ZeRO-2/3, FSDP support
Auto-Download Datasets	4+ popular preference datasets
Experiment Tracking	Weights & Biases, TensorBoard
CLI & Python API	Use via command-line or as library
Production-Ready	Error handling, logging, validation

Why This Framework?#

Problem with existing tools:

Research code that breaks in production
Hard to switch between RL methods
Poor memory management (OOM errors)
No proper configuration management
Missing experiment tracking

LLM-RL solves this:

Clean, modular architecture
Easy method comparison
Automatic memory optimization
YAML configs with validation
Built-in observability

Understanding the 4 RL Methods#

1. DPO (Direct Preference Optimization)#

Best for: Getting started, stable training, most use cases

How it works:

Given: Prompt + Two responses (chosen vs rejected)

Example:
Prompt: "Explain quantum computing"
✅ Chosen: "Quantum computing uses quantum bits (qubits) that can be 0, 1, or both..."
❌ Rejected: "Quantum computing is complicated stuff about particles and waves and..."

DPO: Directly optimizes model to prefer chosen over rejected
└─ No reward model needed!
└─ Most stable convergence
└─ Easiest to implement

Advantages:

✅ No reward model training required
✅ Most stable and reliable
✅ Simpler than PPO
✅ Works well for most cases

When to use:

First time doing RLHF
Want stable, predictable training
Have pairwise preference data
Don't want complexity of PPO

2. PPO (Proximal Policy Optimization)#

Best for: Maximum control, research, custom reward functions

How it works:

Step 1: Train reward model
├─ Learn to score outputs like humans
└─ Predict: "How good is this response?"

Step 2: Generate rollouts
├─ Model generates many responses
└─ Reward model scores each

Step 3: Policy optimization
├─ Update policy to maximize reward
└─ Clip updates to prevent collapse

Advantages:

✅ Most flexible (custom rewards)
✅ Classic RL approach
✅ Works with any reward signal
✅ Active research area

Challenges:

⚠️ Requires reward model training
⚠️ More complex than DPO
⚠️ Can be unstable
⚠️ Needs careful hyperparameter tuning

When to use:

Need custom reward functions
Have computational resources
Want maximum control
Doing research

3. Online DPO#

Best for: Iterative improvement, self-improvement, data efficiency

How it works:

Iteration 1:
├─ Generate responses from current model
├─ Rank with reward model (or heuristic)
├─ Create preference pairs
└─ Train with DPO

Iteration 2:
├─ Model improved, generates better responses
├─ Create harder preference pairs
└─ Train again

...repeat until converged

Advantages:

✅ Self-improving loop
✅ More data-efficient
✅ Adapts to model's current level
✅ Can bootstrap from small dataset

When to use:

Want iterative improvement
Limited initial preference data
Need active learning
Want self-improving systems

4. GRPO (Group Relative Policy Optimization)#

Best for: Sample efficiency, batch preference learning

How it works:

Traditional DPO: Compare 2 responses pairwise
Prompt → [Response A vs Response B]

GRPO: Compare multiple responses at once
Prompt → [Response A, B, C, D]
└─ Rank all responses together
└─ Learn from group preferences
└─ More efficient than pairwise

Advantages:

✅ More sample-efficient
✅ Better use of compute
✅ Learns from multiple comparisons
✅ Reduces preference data needs

When to use:

Want maximum efficiency
Can generate multiple responses per prompt
Have group preference data
Limited labeled data

Quick Comparison: Which Method to Use?#

Method	Difficulty	Stability	Efficiency	Best For
DPO	⭐ Easy	⭐⭐⭐ High	⭐⭐ Good	Most users, getting started
PPO	⭐⭐⭐ Hard	⭐ Moderate	⭐⭐ Good	Research, custom rewards
Online DPO	⭐⭐ Medium	⭐⭐ Good	⭐⭐⭐ High	Iterative improvement
GRPO	⭐⭐ Medium	⭐⭐ Good	⭐⭐⭐ Highest	Maximum efficiency

My recommendation: Start with DPO, then explore others once you understand the basics.

Prerequisites#

What you need:

Python 3.10+
GPU with 8GB+ VRAM (or use QLoRA for 4GB)
Basic understanding of LLMs and fine-tuning
OpenAI API key (for some datasets)
30 minutes to follow along

Optional but recommended:

Weights & Biases account (for tracking)
Multi-GPU setup (for larger models)

Step 1: Installation and Setup#

Using UV (Recommended - 10-100x faster!)#

# Clone the repository
git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git
cd AI_CheatSheet/llm-rl

# Install with UV
uv sync

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows

Using Traditional pip#

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install
pip install -e .

Verify Installation#

# Check CLI is available
llm-rl --help

# Should show:
# Usage: llm-rl [OPTIONS] COMMAND [ARGS]...
# Commands:
#   train              Train a model with RL
#   train-reward       Train a reward model
#   download-dataset   Download a dataset
#   list-datasets      List available datasets

Check GPU Availability#

Before training, verify your GPU setup:

# Check if CUDA is available
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# List all GPUs
nvidia-smi

# Check GPU details in Python
python -c "
import torch
print(f'Number of GPUs: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
    print(f'  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB')
"

Output example:

CUDA available: True
Number of GPUs: 2
GPU 0: NVIDIA RTX 4090
  Memory: 24.00 GB
GPU 1: NVIDIA RTX 3090
  Memory: 24.00 GB

Select which GPU(s) to use:

# Use only GPU 0
export CUDA_VISIBLE_DEVICES=0

# Use GPU 1 only
export CUDA_VISIBLE_DEVICES=1

# Use both GPUs
export CUDA_VISIBLE_DEVICES=0,1

Step 2: Understanding Preference Datasets#

What is Preference Data?#

Format:

{
  "prompt": "Explain machine learning to a 5-year-old",
  "chosen": "Machine learning is like teaching a robot by showing it lots of examples...",
  "rejected": "Machine learning is a subset of artificial intelligence that utilizes..."
}

Key insight: We need pairs of responses (better vs worse) for the same prompt.

Available Datasets#

Our framework auto-downloads these datasets:

Dataset	Size	Domain	Best For
ultrafeedback-binarized	64k	General	Getting started
anthropic-hh-rlhf	160k	Helpfulness/Safety	Production models
stack-exchange-preferences	10M	Technical Q&A	Domain-specific
summarize-from-feedback	90k	Summarization	Specific tasks

Download a Dataset#

# List available datasets
llm-rl list-datasets

# Download dataset
llm-rl download-dataset ultrafeedback-binarized --output ./data

# Or download with Python
python examples/download_datasets.py

Dataset will be cached in ~/.cache/huggingface/datasets/ for reuse.

Step 3: Your First RLHF Training with DPO#

Let's start with DPO - the easiest and most stable method.

Option 1: Using CLI (Quickest)#

# Train with default config
llm-rl train --config configs/training/dpo/base.yaml

# Output:
# Loading model: gpt2
# Loading dataset: ultrafeedback-binarized
# Starting DPO training...
# Epoch 1/1: 100%|████████| 625/625 [15:23<00:00]
# Training complete!

GPU Selection: If you have multiple GPUs, specify which one to use:

# Use GPU 0
export CUDA_VISIBLE_DEVICES=0
llm-rl train --config configs/training/dpo/base.yaml

# Use GPU 1
export CUDA_VISIBLE_DEVICES=1
llm-rl train --config configs/training/dpo/base.yaml

# Use multiple GPUs (0 and 1)
export CUDA_VISIBLE_DEVICES=0,1
llm-rl train --config configs/training/dpo/base.yaml

Option 2: Using Python API (More Control)#

File: train_dpo.py

from llm_rl.config import DPOConfig
from llm_rl.trainers import train_dpo

# Load config from YAML
config = DPOConfig.from_yaml("configs/training/dpo/base.yaml")

# Train
metrics = train_dpo(config)

print(f"Training complete! Final metrics: {metrics}")

Run it:

# Default (uses all available GPUs)
python train_dpo.py

# Use specific GPU
export CUDA_VISIBLE_DEVICES=0
python train_dpo.py

Option 3: Programmatic Config (Maximum Flexibility)#

from llm_rl.config import (
    DPOConfig,
    ModelConfig,
    PeftConfig,
    DatasetConfig,
    TrainingConfig
)

config = DPOConfig(
    method="dpo",

    # Model settings
    model=ModelConfig(
        model_name_or_path="gpt2",
        use_peft=True,
        load_in_4bit=False,  # Enable for low memory
        torch_dtype="bfloat16",
    ),

    # LoRA settings (parameter-efficient fine-tuning)
    peft=PeftConfig(
        r=16,              # LoRA rank
        lora_alpha=32,     # LoRA alpha
        lora_dropout=0.05,
    ),

    # Dataset settings
    dataset=DatasetConfig(
        dataset_name="ultrafeedback-binarized",
        max_train_samples=10000,  # Use subset for fast iteration
        max_length=512,
    ),

    # Training settings
    training=TrainingConfig(
        output_dir="./outputs/my_dpo_model",
        num_train_epochs=1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # Effective batch size = 16
        learning_rate=5e-5,
        bf16=True,  # Use bfloat16 for faster training
        gradient_checkpointing=True,  # Save memory
    ),

    # DPO-specific hyperparameters
    beta=0.1,  # Controls strength of preference optimization
)

# Train
metrics = train_dpo(config)

Understanding Key Parameters#

beta (DPO-specific):

Controls how strongly to optimize for preferences
Higher = stronger preference signal
Lower = stays closer to reference model
Default: 0.1 (works well for most cases)
Range: 0.01 - 0.5

lora_alpha / r (LoRA):

r: Rank of LoRA matrices (lower = fewer parameters)
lora_alpha: Scaling factor
Rule of thumb: lora_alpha = 2 * r
Common: r=16, lora_alpha=32

gradient_accumulation_steps:

Accumulates gradients over N steps before updating
Effective batch size = batch_size * gradient_accumulation_steps * num_gpus
Use to simulate larger batches with limited memory

Step 4: Understanding the Training Process#

What Happens During Training#

Iteration 1:
├─ Load batch of preference pairs
├─ Forward pass: Get log probabilities
│  ├─ Policy model: P(chosen | prompt)
│  └─ Policy model: P(rejected | prompt)
├─ Compute DPO loss
│  └─ Loss = log(sigmoid(beta * (log_pi_chosen - log_pi_rejected)))
├─ Backward pass: Compute gradients
└─ Update model weights

Iteration 2:
...repeat

After N iterations:
└─ Model learns to prefer chosen over rejected responses

Monitoring Training#

TensorBoard (default):

# In another terminal
tensorboard --logdir outputs/dpo_gpt2/runs

Metrics to watch:

train/loss: Should decrease over time
train/rewards/chosen: Should increase
train/rewards/rejected: Should decrease
train/rewards/margins: Should increase (chosen - rejected)

Good training:

Epoch 1: loss=0.65, margin=0.15
Epoch 2: loss=0.52, margin=0.28
Epoch 3: loss=0.41, margin=0.42  ✅ margin increasing

Bad training:

Epoch 1: loss=0.65, margin=0.15
Epoch 2: loss=0.58, margin=0.12
Epoch 3: loss=0.61, margin=0.08  ❌ margin decreasing (model forgetting)

Step 5: Training PPO (Advanced)#

PPO requires a reward model first. Let's train one!

Step 5.1: Train Reward Model#

File: train_reward_model.py

from llm_rl.config import RewardModelConfig
from llm_rl.trainers import train_reward_model

config = RewardModelConfig.from_yaml("configs/reward_model.yaml")

# Train reward model
metrics = train_reward_model(config)

# Saves to: ./outputs/reward_model/

What it does:

Takes preference data
Trains model to predict which response is better
Outputs scalar score: higher = better response

Step 5.2: Train with PPO#

File: train_ppo.py

from llm_rl.config import PPOConfig
from llm_rl.trainers import train_ppo

config = PPOConfig(
    method="ppo",

    model=ModelConfig(
        model_name_or_path="gpt2",
        use_peft=True,
    ),

    # Point to trained reward model
    reward_model_path="./outputs/reward_model/checkpoint-final",

    # PPO-specific parameters
    kl_penalty="kl",  # KL divergence penalty type
    adap_kl_ctrl=True,  # Adaptive KL coefficient
    init_kl_coef=0.2,  # Initial KL coefficient
    target_kl=6.0,  # Target KL divergence
)

metrics = train_ppo(config)

PPO Training Loop:

For each batch:
├─ Generate responses from current policy
├─ Score responses with reward model
├─ Compute advantages
├─ Update policy with PPO loss
│  ├─ Policy loss
│  ├─ Value loss
│  └─ KL penalty (stay close to reference)
└─ Repeat

Step 6: Memory Optimization for Large Models#

Problem: Out of Memory (OOM)#

Training LLaMA-2 7B:

# Without optimization
Full precision: ~28GB VRAM  ❌ Doesn't fit on consumer GPU

# With optimizations
QLoRA + Gradient Checkpointing: ~6GB VRAM  ✅ Fits on RTX 3090!

Technique 1: QLoRA (4-bit Quantization)#

Config:

model:
  model_name_or_path: "meta-llama/Llama-2-7b-hf"
  load_in_4bit: true  # ✅ Enable 4-bit quantization
  bnb_4bit_compute_dtype: "bfloat16"
  bnb_4bit_quant_type: "nf4"
  use_peft: true

peft:
  r: 64  # Can use higher rank with QLoRA
  lora_alpha: 128

Memory savings:

Full model: 4 bytes/parameter
4-bit: 0.5 bytes/parameter
8x memory reduction!

Technique 2: Gradient Checkpointing#

Config:

training:
  gradient_checkpointing: true  # ✅ Enable

Trade-off:

Memory: ~40% reduction
Speed: ~20% slower (recomputes activations)
Worth it for large models!

Technique 3: Gradient Accumulation#

Config:

training:
  per_device_train_batch_size: 2  # Small batch
  gradient_accumulation_steps: 16  # Accumulate
  # Effective batch size = 2 * 16 = 32

Technique 4: DeepSpeed ZeRO#

For multi-GPU training:

training:
  deepspeed: "configs/deepspeed/zero3.json"

Run with multiple GPUs:

# Use GPUs 0, 1, 2, 3
export CUDA_VISIBLE_DEVICES=0,1,2,3
llm-rl train --config configs/training/dpo/base.yaml

# Or specify in one line
CUDA_VISIBLE_DEVICES=0,1,2,3 llm-rl train --config configs/training/dpo/base.yaml

ZeRO-3 partitions:

Optimizer states
Gradients
Model parameters

Result: Train 175B models on 8x A100!

Step 7: Experiment Tracking with Weights & Biases#

Setup#

logging:
  wandb_project: "llm-rlhf-experiments"
  wandb_entity: "my-team"
  wandb_run_name: "dpo-llama2-7b"
  report_to: ["wandb", "tensorboard"]

# Login to W&B
wandb login

What Gets Tracked#

Training/validation loss curves
Reward margins (chosen vs rejected)
Learning rate schedule
GPU utilization
Model checkpoints
Hyperparameters
System metrics

Compare Experiments#

import wandb

# Load multiple runs
api = wandb.Api()
runs = api.runs("my-project")

# Compare DPO vs PPO
dpo_run = [r for r in runs if "dpo" in r.name][0]
ppo_run = [r for r in runs if "ppo" in r.name][0]

print(f"DPO final loss: {dpo_run.summary['train/loss']}")
print(f"PPO final reward: {ppo_run.summary['train/reward']}")

Step 8: Evaluation and Testing#

Qualitative Evaluation#

Test your model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./outputs/dpo_gpt2/checkpoint-final")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Test prompt
prompt = "Explain quantum computing to a 5-year-old:"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
outputs = model.generate(
    **inputs,
    max_length=200,
    do_sample=True,
    top_p=0.95,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0]))

A/B Testing#

Compare base vs fine-tuned:

base_model = AutoModelForCausalLM.from_pretrained("gpt2")
rlhf_model = AutoModelForCausalLM.from_pretrained("./outputs/dpo_gpt2/checkpoint-final")

prompts = [
    "Write a poem about AI:",
    "Explain recursion:",
    "Write a story about a robot:",
]

for prompt in prompts:
    print(f"\n=== Prompt: {prompt} ===")

    # Base model
    base_out = generate(base_model, prompt)
    print(f"Base: {base_out}")

    # RLHF model
    rlhf_out = generate(rlhf_model, prompt)
    print(f"RLHF: {rlhf_out}")

Quantitative Metrics#

MT-Bench, AlpacaEval, etc.:

# Coming soon in framework
llm-rl evaluate \
  --checkpoint ./outputs/dpo_gpt2/checkpoint-final \
  --benchmark mt-bench \
  --output results.json

Advanced: Online DPO for Self-Improvement#

Iterative training loop:

from llm_rl.config import OnlineDPOConfig
from llm_rl.trainers import train_online_dpo

config = OnlineDPOConfig(
    method="online_dpo",

    # Number of iterations
    num_iterations=3,

    # Generate N responses per prompt
    num_generations_per_prompt=4,

    # Use reward model to rank
    reward_model_path="./outputs/reward_model",

    # Rest same as DPO
    model=ModelConfig(...),
    dataset=DatasetConfig(...),
)

metrics = train_online_dpo(config)

What happens:

Iteration 1:
├─ Generate responses from current model
├─ Rank with reward model → create preferences
└─ Train with DPO on new preferences

Iteration 2:
├─ Model is better, generates better responses
├─ Create harder preference pairs
└─ Train again (self-improvement!)

Iteration 3:
└─ Repeat...

Benefits:

Adapts to model's improving capability
Generates progressively harder examples
More data-efficient
Self-improving loop

Advanced: GRPO for Maximum Efficiency#

Group preferences instead of pairwise:

from llm_rl.config import GRPOConfig
from llm_rl.trainers import train_grpo

config = GRPOConfig(
    method="grpo",

    # Generate multiple responses
    group_size=8,  # Compare 8 responses at once

    # Ranking strategy
    ranking_strategy="reward_model",  # or "random", "curriculum"

    model=ModelConfig(...),
    dataset=DatasetConfig(...),
)

metrics = train_grpo(config)

Efficiency gain:

Pairwise DPO: Need N×(N-1)/2 comparisons for N responses
GRPO: Single group ranking
Up to 4x more sample-efficient!

Production Best Practices#

1. Start Small, Scale Up#

# Iteration 1: Fast prototyping
config = DPOConfig(
    model=ModelConfig(model_name_or_path="gpt2"),  # Small model
    dataset=DatasetConfig(max_train_samples=1000),  # Small dataset
    training=TrainingConfig(num_train_epochs=1),
)

# Iteration 2: Full training
config = DPOConfig(
    model=ModelConfig(model_name_or_path="meta-llama/Llama-2-7b-hf"),
    dataset=DatasetConfig(max_train_samples=None),  # Full dataset
    training=TrainingConfig(num_train_epochs=3),
)

2. Checkpoint Management#

training:
  save_steps: 500  # Save every 500 steps
  save_total_limit: 3  # Keep only 3 checkpoints
  load_best_model_at_end: true  # Load best checkpoint

3. Hyperparameter Tuning#

Key hyperparameters to tune:

Parameter	Range	Impact
`beta`	0.01-0.5	Preference strength
`learning_rate`	1e-6 to 1e-4	Training speed/stability
`lora_r`	8-128	Model capacity
`max_length`	256-2048	Context size

Use W&B Sweeps:

# sweep.yaml
program: train_dpo.py
method: bayes
metric:
  name: eval/loss
  goal: minimize
parameters:
  beta:
    min: 0.05
    max: 0.3
  learning_rate:
    min: 1e-6
    max: 1e-4

wandb sweep sweep.yaml
wandb agent <sweep-id>

4. Error Handling#

Common issues:

Error	Solution
OOM	Enable 4-bit quantization, reduce batch size, enable gradient checkpointing
NaN loss	Lower learning rate, add gradient clipping, check data quality
Slow convergence	Increase learning rate, reduce beta, check data is balanced
Model collapse	Lower learning rate, increase KL penalty (PPO), reduce beta (DPO)

Real-World Applications#

1. Customer Support Bots#

Problem: Generic responses, doesn't follow company tone

Solution: Fine-tune with RLHF on company's historical chats

# Prepare preference data
preferences = [
    {
        "prompt": "How do I return a product?",
        "chosen": "I'd be happy to help! Our return policy...",  # Friendly
        "rejected": "Check the return policy on our website."  # Cold
    },
]

# Train with DPO
config = DPOConfig(...)
train_dpo(config)

Result: Bot matches company tone and style

2. Code Assistants#

Problem: Generates code that doesn't follow best practices

Solution: RLHF on code preferences (clean vs messy)

preferences = [
    {
        "prompt": "Write a function to calculate fibonacci",
        "chosen": "def fib(n):\n    a, b = 0, 1\n    for _ in range(n):\n        a, b = b, a+b\n    return a",
        "rejected": "def f(x):\n    if x==0: return 0\n    if x==1: return 1\n    return f(x-1)+f(x-2)"  # Inefficient recursion
    }
]

3. Content Moderation#

Problem: Model generates harmful/biased content

Solution: RLHF with safety preferences

preferences = [
    {
        "prompt": "Write about...",
        "chosen": "[Safe, balanced response]",
        "rejected": "[Biased or harmful response]"
    }
]

Plus: Use PPO with custom safety reward model

4. Educational Tutors#

Problem: Explanations too complex or too simple

Solution: RLHF on explanation quality

preferences = [
    {
        "prompt": "Explain photosynthesis",
        "chosen": "Photosynthesis is how plants make food using sunlight...",  # Right level
        "rejected": "Photosynthesis (from Greek φῶς, phōs...) is a process..."  # Too complex
    }
]

Troubleshooting Guide#

Issue 1: Training is Unstable#

Symptoms: Loss spikes, NaN values, model collapse

Solutions:

training:
  learning_rate: 1e-6  # Lower LR
  max_grad_norm: 0.5  # Stronger gradient clipping
  warmup_ratio: 0.1  # More warmup

# DPO
beta: 0.05  # Lower beta

# PPO
init_kl_coef: 0.5  # Higher KL penalty

Issue 2: Model Not Learning#

Symptoms: Loss doesn't decrease, margins stay flat

Debug:

# Check data quality
from datasets import load_dataset

ds = load_dataset("ultrafeedback-binarized", split="train")
print(ds[0])

# Verify chosen != rejected
assert ds[0]["chosen"] != ds[0]["rejected"]

Solutions:

Check data is properly formatted
Increase learning rate
Train longer
Reduce regularization

Issue 3: OOM on Multi-GPU#

Symptoms: Works on 1 GPU, OOM on multiple

Solution:

training:
  ddp_find_unused_parameters: false  # Disable this
  gradient_checkpointing: true
  per_device_train_batch_size: 1  # Reduce per-GPU batch

Issue 4: GPU Not Detected or Using Wrong GPU#

Symptoms: Training runs on CPU, or uses wrong GPU

Check available GPUs:

# List all GPUs
nvidia-smi

# Check which GPUs PyTorch sees
python -c "import torch; print(torch.cuda.device_count()); print(torch.cuda.get_device_name(0))"

Solutions:

# Select specific GPU
export CUDA_VISIBLE_DEVICES=0  # Use GPU 0
python train_dpo.py

# Select multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1

# Make GPU 1 appear as GPU 0 to the script
export CUDA_VISIBLE_DEVICES=1
python train_dpo.py  # Script sees this as device 0

# Check current setting
echo $CUDA_VISIBLE_DEVICES

Common scenarios:

Shared server: Other users on GPU 0? Use export CUDA_VISIBLE_DEVICES=1,2,3
Memory issue: One GPU has less memory? Use export CUDA_VISIBLE_DEVICES=1 (skip GPU 0)
Testing: Want to test on smaller GPU first? Use export CUDA_VISIBLE_DEVICES=3

Conclusion#

You've learned how to fine-tune LLMs with RLHF from scratch to production!

What You've Mastered#

✅ Understand RLHF: Why it's crucial for aligning LLMs ✅ 4 RL Methods: DPO, PPO, Online DPO, GRPO ✅ Production Features: QLoRA, distributed training, tracking ✅ Practical Skills: Train models, optimize memory, debug issues ✅ Real-World Applications: Customer support, code assistants, safety

Next Steps#

Beginner Projects:

Fine-tune GPT-2 on UltraFeedback with DPO
Compare DPO vs base model outputs
Add W&B tracking to experiments

Intermediate Projects:

Train LLaMA-2 7B with QLoRA
Build custom preference dataset
Implement Online DPO for self-improvement

Advanced Projects:

Train reward model + PPO pipeline
Multi-GPU training with DeepSpeed
Deploy RLHF model to production

The Power of RLHF#

RLHF transforms generic language models into helpful, harmless, and honest assistants. Whether you're building chatbots, code assistants, or content generators, RLHF is essential for production AI.

Start experimenting today!

Resources#

Source Code: GitHub - LLM-RL
Full AI CheatSheet Collection: AI_CheatSheet Repository
Research Papers:
- InstructGPT - Original RLHF paper
- DPO - Direct Preference Optimization
- PPO - Proximal Policy Optimization
Datasets:
- UltraFeedback
- Anthropic HH-RLHF
Related Tools:

Clone and try it yourself:

git clone https://github.com/MinhQuanBuiSco/AI_CheatSheet.git
cd AI_CheatSheet/llm-rl
uv sync
source .venv/bin/activate
llm-rl train --config configs/training/dpo/base.yaml

Questions? Open an issue on GitHub!

Happy fine-tuning! 🚀🤖