Member-only story

Test driven LLM prompt engineering with promptfoo and Ollama

Can we test that an AI made a joke that was funny enough?

4 min readApr 21, 2024

As large language models (LLMs) evolve from simple chatbots to complex AI agents, we need a solution to evaluate their effectiveness from prompt and model changes over time.

In traditional software development, we can write a suite of tests for our work to prevent quality regressions and set a benchmark. For example, we could test that a sum function that accepts two numbers should always be able to return true for “1+1=2”. This deterministic testing strategy has worked for us for decades, but LLM app development presents a new challenge: non-deterministic responses. Instead of evaluating “1+1=2”, we can find ourselves in the impossible situation of trying to evaluate whether the LLM made a funny joke, which can also change on every single call.

Promptfoo is a tool for testing and evaluating LLM output quality in a non-deterministic environment.

The benefits:

✅ Create side-by-side comparisons of output quality from multiple prompts and inputs
🤖 Use a combination of static assertions and LLM-driven judges to assess non-deterministic responses
💵 Compare both open and closed-source models to…

Test driven LLM prompt engineering with promptfoo and Ollama

Can we test that an AI made a joke that was funny enough?

Written by Chanon Roy

No responses yet