Website Testing Using LLM

🚨 The Problem#

Manual test case creation doesn't scale with application complexity — developers spend hours writing routine scenarios, and human-authored tests systematically miss the edge cases nobody thought to write down. The result is incomplete coverage that only surfaces as production bugs.

💡 The Decision#

Converting plain-English requirements into Selenium scripts via LLM was the straightforward part. The harder decision was which models: commercial LLMs (GPT, Claude, Gemini) give strong reasoning out of the box but get expensive at the volume a real test suite generates, especially with ambiguous language inputs that need multiple generation attempts to resolve. The default answer would have been "just pay for GPT-4o at scale."

Instead I fine-tuned open-source models (Llama, Qwen) on domain-specific test-scenario data — and the fine-tuned open models ended up outperforming ChatGPT-4o in the specific scenarios they were trained for, at a fraction of the marginal cost. Commercial models stayed in the pipeline for the reasoning-heavy, novel-scenario cases where fine-tuned specialization doesn't help; the fine-tuned open models took the high-volume, in-domain generation.

🏗️ How It Was Built#

AI-powered website testing pipeline — the big picture, then two independent loops: test generation and cross-browser execution

Two loops:

Generate. A plain-English requirement goes to the multi-LLM layer for scenario generation, gets converted to a production-ready Selenium Python script, then passes through an AI-powered validation engine before it's trusted — this is what makes the ambiguous-language problem tractable, since a bad interpretation gets caught before it becomes a flaky test.

Execute. Every CI run executes the generated suite across Chrome, Firefox, and Safari, with reporting and coverage metrics feeding back into the pipeline for continuous improvement.

📈 Impact & Results#

50% more diverse test cases than manual authoring, 70% less manual effort in scenario creation
40% faster development iteration cycles, with 35% better bug detection from the expanded edge-case coverage
Fine-tuned open-source models outperforming ChatGPT-4o on in-domain test generation, at substantially lower marginal cost — adopted by multiple development teams across enterprise applications

Website Testing Using LLM

🚨 The Problem#

💡 The Decision#

🏗️ How It Was Built#

📈 Impact & Results#

Key Achievements