Member-only story
Test driven LLM prompt engineering with promptfoo and Ollama
Can we test that an AI made a joke that was funny enough?
As large language models (LLMs) evolve from simple chatbots to complex AI agents, we need a solution to evaluate their effectiveness from prompt and model changes over time.
In traditional software development, we can write a suite of tests for our work to prevent quality regressions and set a benchmark. For example, we could test that a sum function that accepts two numbers should always be able to return true for “1+1=2”. This deterministic testing strategy has worked for us for decades, but LLM app development presents a new challenge: non-deterministic responses. Instead of evaluating “1+1=2”, we can find ourselves in the impossible situation of trying to evaluate whether the LLM made a funny joke, which can also change on every single call.
Promptfoo is a tool for testing and evaluating LLM output quality in a non-deterministic environment.
The benefits:
- ✅ Create side-by-side comparisons of output quality from multiple prompts and inputs
- 🤖 Use a combination of static assertions and LLM-driven judges to assess non-deterministic responses
- 💵 Compare both open and closed-source models to…