Evaluation Testing in Software Using LLMs

Dear readers,

I hope that you are as curious as I am and join me on this learning journey. So, get your curiosity ready and let’s get started. 🙂

Why do we need to approach testing differently in an LLM based software?

Software testing has been my passion for years, and therefore, I will deep dive and explore the options about it in the context of LLMs. Using LLMs has introduced new challenges into how we approach testing. So it’s very important to know how we can combine our current knowledge about testing and how we approach testing into this new era. In traditional non LLM based software, the output is predictable. Tests in which we compare specific output against expected one and given input is concrete and can be built with existing knowledge. Whereas in software with LLM we have a nondeterministic output. This means that every time we might receive a different correct response for a given input. As a result, this makes testing with a fixed input and output and comparing this against expected output, more challenging and difficult.

LLM Software Use Cases - Defines the Test Scope

First, I want to set some boundaries in the context of this post. We differentiate between two major system use cases.

Use Case 1 - Building LLM Software

The first one is if you are a builder of an LLM system itself. This is the use case when you are creating the large language model itself, training it on a lot of data, fine tuning it etc. This type of software naturally poses different challenges in comparison to if you are the user of an LLM in your software solution.

Use Case 2 - Building Software Using LLM(s)

In this set up, you have a particular business use case and you solve it by integrating an existing LLM solution into your software. A typical example is integrating the RAG pattern. Checkout my post about RAG for more details.

My major focus will be on the latter group

Working with Evaluation Testing shortly Evals, can benefit both use cases. If you are an LLM builder it can be also useful to use benchmarks. Benchmarks allow you to test your model against specific industry based tasks. So it is suitable for model comparison, but by far not sufficient. You still need to use Evals. If you integrate an LLM into your software solution, you will probably not benefit much from benchmarks, depending on the context, of course, as these are suitable for generic tasks. Testing context-specific tasks can only be achieved with Evals.

Evaluation Testing

As with traditional software, I would say that planning and defining what you would like to actually test and what your core use cases are, is also extremely beneficial in the context of LLM. Based on this approach, the domain experts of the solution should develop use cases that form the basis for test scenarios. Especially for non deterministic output, it is very beneficial to define concrete preferred keywords and sentence structures that you want to verify.

Given you have your test cases, you have defined what you are going to test. This step is very crucial. The following step is to decide how you are going to test it. Evals gives us the following options:

Human evaluation with prompt engineering

This testing type is manual. It consists of a human entering the prompt and verifying the result. The results can be documented for each test scenario and later on this data can be used for automation. Although this is a very reliable method, it has the disadvantage of being a manual and time-consuming testing.

LLM self evaluation

This testing type is automated. It consists of using the same LLM that you use in your product solution to evaluate the results. It should be easier to implement in comparison to LLM-as-a-judge, but indeed less reliable, because you ask the same model to potentially find problems in its result.

LLM-as-a-judge

This is also an automated testing type. It uses a different LLM to evaluate the results from the LLM used in your software. This is in a way more reliable, because there is lower probability that both models hallucinate for the same task. In addition to that, reviewing generated text is an easier task for a model than generating the text. This makes LLM-as-a-Judge a preferable option, if you would like to automate your evals.

Although the options listed above, we still have challenges in evals. Some of the main ones are - non deterministic output, no single correct answer, many possible inputs, model hallucination.

LLM-as-a-Judge

In the next section I am going to explain the different types of LLM-as-a-Judge and how to define a high quality prompt for the evaluation.

LLM-as-a-Judge Types

Single Output Scoring without reference

The task of the judge is to provide a score for a given input and single output. The score can be for example a scale from 1 to 5 or a binary based evaluation like helpful and not helpful.

Single Output Scoring with reference

Similar like the first type, but here a reference answer is provided to the judge. This can be very useful in precise based tasks like solving math problems. It can however also be used for a customer service use case.

Pairwise comparison

The task of the judge is to choose a better answer, when it is provided with two options and an evaluation criteria which helps to choose the better one. This approach is useful for A/B Testing like comparing responses of the same task between two models.

Master Your Prompt

When using the LLM-as-a-judge approach, creating a high-quality prompt is crucial to making your chances higher to get precise and high-quality results. The first step you need to be very precise and clear in your question and instructions about what and how exactly you want the assessment to happen. For example if you want to use binaries like labeling a question answer pair as helpful or not helpful. You need to explain very clearly what you understand behind these words. Another probably obvious, but on the other hand a very important aspect is to choose wisely your model. In the best case you should use a capable model that can really guarantee you more high-quality results. In addition to that, it’s important to use other capabilities for the model. The first one is to ask for reasoning. This means that you don’t simply get the result, but you also want an explanation from the large language model how it got to this result. The second one is to use a low temperature. This will reduce the risk of hallucination and keep a higher probability of more reliable results. Crafting your prompt is indeed a very crucial and essential part because this is your instruction to the model.

Final Thoughts

Evaluation testing is a powerful new methodology in the age of LLMs. It is good that we continue to think about testing in this new challenging field. What I liked is that there are slowly more solutions in this direction. On the other hand, I find the current solutions immature compared to what we can achieve today with automation in software without LLMs. Nevertheless, the most reliable method for testing software with LLM is human based prompting, which is very time-consuming and at some point it becomes impossible to find errors if you rely solely on manual testing. So I am curious to see how this will develop in the future and hope that we will soon have more mature solutions on this topic.

Happy learning! 🙂

Why do we need to approach testing differently in an LLM based software?#

LLM Software Use Cases - Defines the Test Scope#

Use Case 1 - Building LLM Software#

Use Case 2 - Building Software Using LLM(s)#

My major focus will be on the latter group#

Evaluation Testing#

Human evaluation with prompt engineering#

LLM self evaluation#

LLM-as-a-judge#

LLM-as-a-Judge#

LLM-as-a-Judge Types#

Single Output Scoring without reference#

Single Output Scoring with reference#

Pairwise comparison#

Master Your Prompt#

Final Thoughts#

References#