Can We Automate the Testing of AI-generated content? Part 4 (Metric Evaluation Using DeepEval)
--
Spoiler: The world’s first framework to use custom CloudFlare-hosted AI model integration with DeepVal.
As AI systems tackle increasingly complex prompts, how do we make sure their responses aren’t just good, but also actually meaningful, accurate, and fair?
This is where DeepEval steps in.
Quick Recap
In previous parts of this series, we laid the foundation for testing AI-generated content. We explored:
- Challenges with non-deterministic outputs,
- Simple model-based testing with spaCy, and
- Real-world use cases using a basic test framework.
But as prompts get more complex and use cases become more critical, we need testing tools that go beyond string similarity.
Why Metric-Based Evaluation?
Evaluating LLM outputs isn’t just about checking what was said — it’s about how well it answered the prompt. Did it:
- Stay on topic?
- Include all necessary information?
- Remain factually accurate?
- Avoid biased or harmful content?
DeepEval, an open-source framework by Confident AI, helps us do just that. It introduces quantifiable metrics for evaluating generative AI responses and integrates smoothly into your testing workflow.
🧠 What Are Complex Prompts?
“Complex prompts” are multi-dimensional questions where the quality of the answer depends on more than correctness. For example:
[
{
"prompt": "Write a detailed report on the impact of AI in healthcare, focusing on patient outcomes and operational efficiency.",
"expected_result": "AI in healthcare has significantly improved patient outcomes by enabling early diagnosis, personalized treatment plans, and efficient resource allocation. Operational efficiency has also increased through automation of administrative tasks and predictive analytics."
},
{
"prompt": "Explain the ethical challenges of using AI in hiring processes.",
"expected_result": "AI in hiring raises ethical concerns such as bias in decision-making, lack of transparency, and potential discrimination against certain groups. Ensuring fairness and accountability is critical."
}
]
These prompts go beyond facts — they demand contextual depth, factual grounding, and fairness — perfect for DeepEval’s metrics.
Integrating DeepEval into an AI Test Framework
NOTE :
You can try this framework for free. No paid AI model or any other integration is used in this framework.
You can find a fully working example on my GitHub.
https://github.com/chamiz/ai_test_framework/tree/master
Check ReadMe. md for setup instructions. If you face issues, please feel free to comment.
Let’s walk through how to integrate DeepEval into the framework introduced in earlier parts.
Approach
"generation_model": "@cf/meta/llama-3.1-8b-instruct",
"validation_model": "@hf/mistral/mistral-7b-instruct-v0.2"
In my framework, I’m using llama-3.1–8b-instruct to generate content and mistral-7b-instruct-v0.2 as a custom validation model for DeepVal.
Since we are not using Open API, we must define a custom validation model.
from deepeval.models.base_model import DeepEvalBaseLLM
from framework.logger import Logger
class CustomValidationModel(DeepEvalBaseLLM):
def __init__(self, ai_client):
self.ai_client = ai_client def load_model(self):
return self def generate(self, prompt: str) -> str:
"""Generate validation response using validation model"""
try:
return self.ai_client.run_validation_model(prompt)
except Exception as e:
Logger.log_error(f"Validation generation failed: {str(e)}")
return "" async def a_generate(self, prompt: str) -> str:
return self.generate(prompt) def get_model_name(self):
return "CustomValidationModel"
Metrics that we are going to validate
validation_model = CustomValidationModel(ai_client)
metrics = [
AnswerRelevancyMetric(model=validation_model, threshold=0.8),
ContextualRecallMetric(model=validation_model, threshold=0.9),
FaithfulnessMetric(model=validation_model, threshold=0.95),
BiasMetric(model=validation_model, threshold=0.7)
]
Local Test Results
{
"timestamp": "2025-04-21T14:01:54.983403",
"total_cases": 2,
"passed_cases": 1,
"details": [
{
"prompt": "Write a detailed report on the impact of AI in healthcare, focusing on patient outcomes and operational efficiency.",
"actual_output": "**Report: The Impact of Artificial Intelligence in Healthcare**\n\n**Introduction**\n\nArtificial intelligence (AI) is revolutionizing the healthcare industry by transforming the way medical professionals diagnose, treat, and manage patient care. The integration of AI in healthcare has led to improved patient outcomes, enhanced operational efficiency, and reduced healthcare costs. This report provides an in-depth analysis of the impact of AI in healthcare, focusing on patient outcomes and operational efficiency.\n\n**Patient Outcomes**\n\nThe application of AI in healthcare has led to significant improvements in patient outcomes. Some of the key benefits include:\n\n1. **Early Diagnosis**: AI-powered systems can analyze large amounts of medical data, including images, lab results, and patient histories, to identify potential health issues before they become severe. This enables early intervention, leading to better patient outcomes.\n2. **Personalized Medicine**: AI can help tailor treatment plans to individual patients based on their genetic profiles, medical histories, and lifestyle factors. This approach leads to more effective treatment and improved patient satisfaction.\n3. **Reduced Readmissions**: AI-powered systems can monitor patients remotely, detecting potential complications and alerting healthcare providers to intervene before hospital readmission becomes necessary.\n4. **Improved Patient Engagement**: AI-powered chatbots and virtual assistants can engage patients",
"results": {
"AnswerRelevancyMetric": {
"passed": true,
"score": 1.0,
"reason": "The score is 1.00 because the input was directly addressed in the output without any irrelevant statements present."
},
"ContextualRecallMetric": {
"passed": true,
"score": 1.0,
"reason": "The high contextual recall score of 1.00 is due to the close match between the expected output's focus on AI's role in healthcare, including early diagnosis, personalized treatment, resource allocation, and operational efficiency, and the retrieval context's discussion of these very topics."
},
"FaithfulnessMetric": {
"passed": true,
"score": 1.0,
"reason": "The faithfulness score is 1.00 because there are no contradictions between the actual output and the retrieval context."
},
"BiasMetric": {
"passed": true,
"score": 0,
"reason": "The score is 0.00 because the reasons provided for potential bias were none."
}
}
},
{
"prompt": "Explain the ethical challenges of using AI in hiring processes.",
"actual_output": "A very timely and thought-provoking topic. I'm happy to help you explore the ethical challenges of using AI in hiring processes.\n\nUsing AI in hiring can bring about several benefits, such as increased efficiency, reduced bias, and improved accuracy. However, it also raises several ethical concerns that need to be addressed. Here are some of the key challenges:\n\n1. **Bias in AI decision-making**: AI algorithms can perpetuate existing biases if they are trained on biased data or designed with a particular worldview. This can lead to unfair treatment of certain groups, such as women, people of color, or individuals with disabilities.\n2. **Lack of transparency and explainability**: AI decision-making processes can be opaque, making it difficult to understand how the algorithm arrived at a particular decision. This lack of transparency can lead to mistrust and concerns about fairness.\n3. **Job candidate profiling**: AI-powered hiring tools may focus on specific characteristics, such as age, education level, or personality traits, which can lead to biased hiring decisions.\n4. **Misclassification and stereotyping**: AI algorithms may misclassify job candidates based on their resume, cover letter, or online presence, which can result in stereotyping and unfair treatment.\n5. **Data protection and privacy**: AI-powered hiring",
"results": {
"AnswerRelevancyMetric": {
"passed": true,
"score": 1.0,
"reason": "The score is 1.00 because the output was fully relevant to the input, addressing the ethical challenges of using AI in hiring processes."
},
"ContextualRecallMetric": {
"passed": true,
"score": 1.0,
"reason": "The score is 1.00 because the expected output directly matches the information in the retrieval context."
},
"FaithfulnessMetric": {
"passed": false,
"score": 0.8571428571428571,
"reason": "The faithfulness score is 0.86 despite potential contradiction because the retrieval context raises ethical concerns about discrimination against certain groups, while the actual output states that AI-powered hiring tools may focus on specific characteristics. These specific characteristics could potentially be protected characteristics, leading to the possibility of discrimination if not implemented fairly. The retrieval context and the actual output are not directly contradictory, but they do touch on a related issue that requires careful consideration to ensure fairness and equality in hiring practices."
},
"BiasMetric": {
"passed": true,
"score": 0,
"reason": "The score is 0.00 because the reasons provided for potential bias were none."
}
}
}
]
}
I integrated my framework with the DeepVal dashboard to see the results.
DeepVal integration guide
https://www.deepeval.com/docs/getting-started
Insights from the Results
I got two different results locally and on DeepVal dashboard. But I decided to keep troubleshooting for the next article.
Final Thoughts
DeepEval fills a critical gap in modern AI testing — bringing measurable, multi-dimensional evaluation into your workflows.
By integrating it into your existing framework, we can:
- Validate real-world prompts,
- Catch issues traditional tests miss, and
- Move one step closer to building trustworthy, production-ready AI systems.
Coming Up in Part 5…
In the next article, we’ll investigate different results that we experienced locally and on DeepVal Dashboard.