Can We Automate the Testing of AI-generated content? Part 3 (Key Evaluation Metrics)

Chamila Ambahera
3 min readJan 21, 2025

--

Before we get into more details, let’s discuss how we test AI output today.

How We Test AI Today (What’s Working and What Isn’t)

Currently, we rely on three main strategies:

  1. Human Reviews (Human-in-the-Loop Validation) Having QA Engineers check AI outputs sounds great in theory, but imagine trying to review millions of responses. Also, we’re all biased in our own ways, and what looks good to one reviewer might not satisfy another.
  2. Setting Boundaries (Guardrails for Prompts) In some approaches we try to control AI by limiting what it can say. While this helps avoid disasters, it’s like putting training wheels on a racing bike — you might prevent crashes, but you’re also limiting actual capability.
  3. Safety Nets (Fail-Safe Functions) Building automatic stops and checks helps catch obvious problems. However, these systems often miss the subtle stuff — like when an answer is technically correct but misses the point entirely. If you are a frequent AI user you might be familiar with this issue. ;)

Quick Recap

In the previous article, we discussed implementing a simple Model-Based Testing framework that uses AI/ML models to validate AI outputs.

But using that framework we can only validate the similarity of the answers. However, checking just similarity is not enough to validate AI generated content.

There are Key Evaluation Metrics that help us validate the quality of the output.

Key Evaluation Metrics

When checking AI responses, consider these key areas:

  1. Does It Answer the Question? (Answer Relevancy)

Just like in a conversation, responses need to actually address what was asked, not just sound smart.

2. Does It Use Context Properly? (Contextual Relevancy)
Good responses show understanding of the bigger picture and use relevant background information.

3. Is It Accurate? (Contextual Relevancy)

Checking whether the specific details and facts line up with reality.

4. Is It Complete? (Contextual Recall)

Making sure nothing important gets left out of the response.

5. Does It Make Things Up? (Hallucination Detection)

Catching when AI invents information that wasn’t in the source material.

Finding Better Solutions

What Modern Testing Needs

We need testing that can:

  • Work with unpredictable outputs
  • Keep up with rapid AI growth
  • Give us clear quality measurements
  • Run smoothly without constant babysitting

The Path So far and forward on this article series

So, in order to effectively automate AI testing, we need approaches that-

  1. Accept non-deterministic behaviour while ensuring quality (Achieved with Spacy)
  2. Provide quantifiable metrics for evaluation (To Do)
  3. Scale efficiently with increasing AI deployment (To Do)
  4. Operate autonomously while maintaining reliability (Achieved with Spacy)

The subsequent articles in this series will explore:

  1. Implementation of model-based testing using best available tool to automate the metric evaluation
  2. Integration strategies for continuous testing
  3. Best practices for AI quality assurance

--

--

Chamila Ambahera
Chamila Ambahera

Written by Chamila Ambahera

Principle Automation Engineer | Arctic Code Vault Contributor | Trained Over 500 engineers

No responses yet