Can We Automate the Testing of AI-generated content? Part 3 (Key Evaluation Metrics)

Chamila Ambahera

3 min readJan 21, 2025

Before we get into more details, let’s discuss how we test AI output today.

How We Test AI Today (What’s Working and What Isn’t)

Currently, we rely on three main strategies:

Human Reviews (Human-in-the-Loop Validation) Having QA Engineers check AI outputs sounds great in theory, but imagine trying to review millions of responses. Also, we’re all biased in our own ways, and what looks good to one reviewer might not satisfy another.
Setting Boundaries (Guardrails for Prompts) In some approaches we try to control AI by limiting what it can say. While this helps avoid disasters, it’s like putting training wheels on a racing bike — you might prevent crashes, but you’re also limiting actual capability.
Safety Nets (Fail-Safe Functions) Building automatic stops and checks helps catch obvious problems. However, these systems often miss the subtle stuff — like when an answer is technically correct but misses the point entirely. If you are a frequent AI user you might be familiar with this issue. ;)

Quick Recap

In the previous article, we discussed implementing a simple Model-Based Testing framework that uses AI/ML models to validate AI outputs.

Can We Automate the Testing of AI-generated content? — Part 1

During a recent discussion with Jeffrey Brown, a fractional CTO/Enterprise Architect, he inquired me about testing AI…

ambahera.medium.com

But using that framework we can only validate the similarity of the answers. However, checking just similarity is not enough to validate AI generated content.

There are Key Evaluation Metrics that help us validate the quality of the output.

Key Evaluation Metrics

When checking AI responses, consider these key areas:

Does It Answer the Question? (Answer Relevancy)

Just like in a conversation, responses need to actually address what was asked, not just sound smart.

2. Does It Use Context Properly? (Contextual Relevancy)
Good responses show understanding of the bigger picture and use relevant background information.

3. Is It Accurate? (Contextual Relevancy)

Checking whether the specific details and facts line up with reality.

4. Is It Complete? (Contextual Recall)

Making sure nothing important gets left out of the response.

5. Does It Make Things Up? (Hallucination Detection)

Catching when AI invents information that wasn’t in the source material.

Finding Better Solutions

What Modern Testing Needs

We need testing that can:

Work with unpredictable outputs
Keep up with rapid AI growth
Give us clear quality measurements
Run smoothly without constant babysitting

The Path So far and forward on this article series

So, in order to effectively automate AI testing, we need approaches that-

Accept non-deterministic behaviour while ensuring quality (Achieved with Spacy)
Provide quantifiable metrics for evaluation (To Do)
Scale efficiently with increasing AI deployment (To Do)
Operate autonomously while maintaining reliability (Achieved with Spacy)

The subsequent articles in this series will explore:

Implementation of model-based testing using best available tool to automate the metric evaluation
Integration strategies for continuous testing
Best practices for AI quality assurance