Lesson 6: Testing, Evaluation & Quality Assurance
Course: Enterprise AI Implementation Guide | Lesson 6 of 6
What You'll Learn
By the end of this lesson, you will be able to:
- Build an evaluation framework that catches failures before your users do
- Select the right metrics for your AI use case (not just accuracy)
- Design A/B tests that measure business impact, not just model performance
- Implement continuous monitoring that detects degradation in production
Prerequisites
Before starting this lesson, make sure you've completed:
- Lesson 5: Integration Patterns — your integration architecture determines what you can test and how
- Lesson 4: Data Strategy — data quality directly affects evaluation quality
Or have equivalent experience with:
- Deploying ML models or LLM-based systems to production
- Basic understanding of statistical metrics (precision, recall, F1)
The Testing Gap That Kills AI Projects
Your model hits 94% accuracy on the test set. The team celebrates. You deploy to production. Within two weeks, support tickets double, customers complain about wrong answers, and leadership starts asking hard questions.
This happens because most AI teams confuse offline evaluation with production readiness. They test model performance on historical data, declare success, and ship. What they don't test: how the model behaves on edge cases, how it degrades over time, how it handles data that looks nothing like the training set, and whether "94% accuracy" actually translates to business value.
A Gartner prediction warns that over 40% of agentic AI projects will be scrapped by 2027 because they fail to deliver business value — not because the models were bad, but because teams didn't build the evaluation infrastructure to catch problems early.
Here's the uncomfortable truth: testing is where the real engineering happens. The model is the easy part. The evaluation framework that ensures the model keeps working in production — that's what separates expensive demos from production systems.
The Four-Layer Evaluation Framework
Production AI testing isn't one activity. It's four distinct layers, each catching different failure modes. Skip any layer, and you're flying blind in a specific dimension.
Layer 1: Unit Testing (Model Behavior)
Unit tests for AI check that the model produces correct outputs for known inputs. This isn't about aggregate metrics — it's about specific behaviors you need to guarantee.
What to test:
- Golden set: 50-200 curated examples that represent your critical use cases. Every model update must pass these before deployment.
- Edge cases: Inputs at the boundaries of what your model should handle. Empty inputs, extremely long inputs, adversarial inputs, inputs in unexpected formats.
- Invariance tests: Changing irrelevant features shouldn't change the output. If you rephrase "What's your return policy?" as "What is your return policy?", the answer should be the same.
- Directional tests: Changing relevant features should change the output predictably. A customer with 3 overdue invoices should get a higher risk score than one with 0.
# Example: Golden set test for a support classification model
def test_golden_set():
golden_examples = [
{"input": "I can't log in to my account", "expected": "authentication"},
{"input": "When will my order arrive?", "expected": "shipping"},
{"input": "I want a refund", "expected": "billing"},
{"input": "How do I export my data?", "expected": "feature_question"},
]
results = [model.classify(ex["input"]) for ex in golden_examples]
accuracy = sum(
r == ex["expected"] for r, ex in zip(results, golden_examples)
) / len(golden_examples)
assert accuracy >= 0.95, f"Golden set accuracy {accuracy} below 95% threshold"
# Example: Invariance test
def test_invariance_rephrasing():
pairs = [
("What's your return policy?", "What is your return policy?"),
("How do I cancel?", "I need to cancel my subscription"),
("Pricing info", "What are your prices?"),
]
for original, rephrased in pairs:
assert model.classify(original) == model.classify(rephrased), \
f"Classification changed between '{original}' and '{rephrased}'"
Frequency: Run on every model update, every prompt change, every data pipeline change. Automate in CI/CD.
Layer 2: Slice-Based Evaluation (Fairness and Coverage)
Aggregate metrics hide critical failures. A model with 92% overall accuracy might have 60% accuracy on your highest-value customer segment. Slice-based evaluation breaks performance down by meaningful groups.
Critical slices to evaluate:
| Slice Dimension | Why It Matters | Example |
|---|---|---|
| Customer segment | High-value customers may have different patterns | Enterprise vs SMB support queries |
| Data volume | Models often fail on sparse categories | Rare product returns vs common ones |
| Time period | Data drift hits recent data first | Last 7 days vs last 90 days |
| Input complexity | Edge cases cluster in complex inputs | Multi-topic support tickets |
| Geographic/demographic | Bias detection and fairness | Regional language variations |
# Slice-based evaluation
def evaluate_by_slice(test_data, model):
slices = {
"enterprise": [x for x in test_data if x["segment"] == "enterprise"],
"smb": [x for x in test_data if x["segment"] == "smb"],
"recent_7d": [x for x in test_data if x["age_days"] <= 7],
"high_value": [x for x in test_data if x["ltv"] > 10000],
}
for slice_name, slice_data in slices.items():
accuracy = compute_accuracy(model, slice_data)
print(f"{slice_name}: {accuracy:.2%} (n={len(slice_data)})")
# Alert if any slice drops below threshold
if accuracy < 0.85:
alert(f"DEGRADATION: {slice_name} accuracy at {accuracy:.2%}")
This is where you catch the failures that matter most. A 2% drop in overall accuracy might mean a 15% drop in accuracy for your enterprise customers — the ones generating 80% of revenue.
Layer 3: Integration Testing (System Behavior)
AI models don't run in isolation. They're part of a system with APIs, databases, queues, and user interfaces. Integration tests verify the system works end-to-end — not just the model in isolation.
What integration tests cover:
- Latency: The model returns results within your SLA (typically under 2 seconds for user-facing, under 30 seconds for batch). A model that takes 8 seconds per query is useless for real-time support even if it's highly accurate.
- Throughput: The system handles your expected traffic without degradation. Test at 2x your peak load.
- Error handling: What happens when the model fails? Does the system return a graceful fallback, retry, or crash?
- Data flow: Input data reaches the model correctly. Model outputs reach downstream systems correctly. No data corruption in transit.
- Fallback behavior: When the model's confidence is below threshold, does the human handoff work? When the API is down, does the queue-based fallback activate?
If you built your integration following the API wrapping or RAG patterns from Lesson 5, your test strategy changes. API-wrapped systems need latency and rate-limit tests. RAG systems need retrieval quality tests on top of generation quality tests.
Layer 4: Business Validation (Impact Testing)
This is the layer most teams skip entirely — and it's the one that determines whether your project survives past quarter two.
Business validation answers one question: does the AI system produce the business outcome you promised in the business case?
If you built your business case in Lesson 2 with a projected 40% cost reduction in support, your evaluation framework must measure actual cost reduction — not just ticket deflection rate or model accuracy.
Map model metrics to business metrics:
| Model Metric | Business Metric | Why They Diverge |
|---|---|---|
| Classification accuracy | Customer satisfaction (CSAT) | Wrong classification on high-urgency tickets destroys CSAT even if overall accuracy is high |
| Response generation quality | First-contact resolution rate | A "correct" response that's confusing still generates follow-up tickets |
| Processing speed | Throughput cost per unit | Faster processing only saves money if it reduces headcount or infrastructure |
| Fraud detection rate | Net fraud losses | Catching 99% of fraud but false-flagging 5% of legitimate transactions costs more than the fraud itself |
Selecting the Right Metrics
The biggest mistake in AI evaluation is choosing metrics based on what's easy to measure instead of what matters. Here's how to select metrics that actually drive decisions.
The Metric Selection Framework
Step 1: Start with the business outcome. What did you promise stakeholders? Cost reduction, revenue increase, quality improvement, speed improvement. Write it down.
Step 2: Identify the proxy metric. What measurable quantity correlates with that business outcome? For cost reduction in support, that might be auto-resolution rate. For quality improvement in manufacturing, that might be defect detection rate.
Step 3: Define the counter-metric. Every optimization has a cost. If you optimize for auto-resolution rate, the counter-metric is false resolution rate (tickets marked resolved that come back). If you optimize for fraud detection, the counter-metric is false positive rate.
Step 4: Set thresholds, not targets. "Maximize accuracy" is not actionable. "Accuracy must stay above 90% while false positive rate stays below 3%" is actionable. Thresholds give you clear go/no-go criteria for deployment.
Metric Cheat Sheet by Use Case
| Use Case | Primary Metric | Counter-Metric | Threshold Example |
|---|---|---|---|
| Support classification | Precision per category | Misrouting rate | Precision above 90%, misrouting under 5% |
| Document extraction | Field-level accuracy | Manual review rate | Accuracy above 95%, manual review under 10% |
| Fraud detection | Recall (catch rate) | False positive rate | Recall above 95%, FPR under 2% |
| Demand forecasting | MAPE (error %) | Bias (over vs under-predict) | MAPE under 15%, bias within +/- 3% |
| Content generation | Human preference rate | Hallucination rate | Preference above 70%, hallucination under 2% |
| Voice agents | Task completion rate | Customer escalation rate | Completion above 60%, escalation under 15% |
The Metric Hierarchy
Not all metrics are equal. Structure them in three tiers:
- North star metric (1 metric): The single number that maps to your business case. Report this to executives. Example: cost per resolved support ticket.
- Diagnostic metrics (3-5 metrics): The metrics that explain why the north star moved. Example: auto-resolution rate, average handle time, escalation rate, CSAT.
- Operational metrics (5-10 metrics): The metrics your engineering team watches daily. Example: model latency p95, token usage, retrieval accuracy, cache hit rate, error rate.
When the north star drops, you look at diagnostic metrics to find the cause. When a diagnostic metric drops, you look at operational metrics to find the root cause. This hierarchy prevents alert fatigue — you're not watching 50 dashboards.
A/B Testing AI Systems in Production
Offline evaluation tells you if a model could work. A/B testing tells you if it does work with real users, real data, and real business impact. Statsig's research confirms that F1 score improvements don't always translate to business metric improvements — you need production A/B tests to verify.
The A/B Testing Protocol
Step 1: Define your Overall Evaluation Criterion (OEC).
Pick one metric that captures success. Not accuracy, not latency — the business metric. Revenue per user, cost per ticket resolved, defect escape rate. This is what you're optimizing for.
Step 2: Calculate sample size before you start.
AI A/B tests need larger sample sizes than UI tests because the effect sizes are smaller. A 2% improvement in auto-resolution rate might need 50,000 tickets per variant to reach statistical significance at 95% confidence. If you get 1,000 tickets per day, that's 50 days per variant — plan accordingly.
Step 3: Use progressive rollout, not 50/50 splits.
Day 1-3: 1% traffic to new model (smoke test)
Day 4-7: 5% traffic (early signal)
Day 8-14: 25% traffic (significance builds)
Day 15-21: 50% traffic (full A/B test)
Day 22+: 100% or rollback (decision)
This catches catastrophic failures before they affect most users. If the new model starts generating harmful content or crashing on a specific input pattern, you catch it at 1% traffic, not 50%.
Step 4: Monitor counter-metrics continuously.
Your new model might improve auto-resolution by 5% while increasing CSAT complaints by 8%. Without counter-metric monitoring, you'd declare victory based on the primary metric and create a bigger problem.
Shadow Testing: Test Without Risk
Before A/B testing, run the new model in shadow mode: it processes real production traffic but its outputs are logged, not served to users. Compare shadow outputs to the production model's outputs.
# Shadow testing architecture
async def handle_request(input_data):
# Production model serves the response
production_result = await production_model.predict(input_data)
# Shadow model runs in parallel, output logged only
shadow_result = await shadow_model.predict(input_data)
# Log both for comparison
log_comparison(input_data, production_result, shadow_result)
# Only production result is returned to user
return production_result
Shadow testing answers the question "would the new model have been better?" without any user-facing risk. Run shadow tests for at least one full business cycle (typically one week) before starting A/B tests.
Continuous Monitoring in Production
Deploying a model isn't the finish line — it's the starting point for monitoring. Models degrade. Data drifts. User behavior changes. Without continuous monitoring, your 92% accuracy model silently drops to 75% over three months and nobody notices until the business impact becomes visible.
The Three Signals of Model Degradation
1. Input drift: The data coming into your model starts looking different from what it was trained on. New product categories, seasonal patterns, market shifts. Monitor input distributions weekly and alert on significant changes.
2. Output drift: The model's predictions shift even if inputs haven't changed noticeably. This happens with LLMs when API providers update model versions silently. Monitor output distribution and confidence score distributions.
3. Performance drift: The actual accuracy drops as measured against ground truth labels. This requires a labeling pipeline — automatically sample production predictions and get human labels to measure ongoing accuracy.
# Monitoring setup
class ModelMonitor:
def __init__(self, baseline_stats):
self.baseline = baseline_stats
def check_input_drift(self, recent_inputs):
"""Compare recent input distribution to training distribution"""
drift_score = compute_psi(self.baseline["input_dist"],
compute_distribution(recent_inputs))
if drift_score > 0.2: # PSI threshold
alert("INPUT_DRIFT", f"PSI={drift_score:.3f}")
def check_output_drift(self, recent_outputs):
"""Compare recent output distribution to baseline"""
drift_score = compute_psi(self.baseline["output_dist"],
compute_distribution(recent_outputs))
if drift_score > 0.15:
alert("OUTPUT_DRIFT", f"PSI={drift_score:.3f}")
def check_performance(self, recent_labeled):
"""Compare recent accuracy to baseline"""
current_accuracy = compute_accuracy(recent_labeled)
if current_accuracy < self.baseline["accuracy"] * 0.95: # 5% drop
alert("PERFORMANCE_DRIFT",
f"Accuracy dropped from {self.baseline['accuracy']:.2%} "
f"to {current_accuracy:.2%}")
The Monitoring Dashboard
Every production AI system needs a monitoring dashboard with these views:
| View | Refresh Rate | Audience |
|---|---|---|
| Real-time error rate and latency | Every minute | On-call engineer |
| Daily accuracy on sampled predictions | Daily | ML team |
| Weekly drift scores and distribution plots | Weekly | ML team + PM |
| Monthly business metric impact | Monthly | Leadership |
The tooling landscape has matured significantly. Platforms like Braintrust, Arize, and Evidently AI provide production monitoring out of the box, with trace-level debugging that connects production failures to specific model behaviors. If your team is building MLOps infrastructure from scratch in 2026, you're reinventing a solved problem.
Exercise: Build Your Evaluation Plan
Put your learning into practice:
Task: Pick an AI use case from your organization (or use the customer support classification example from this lesson). Build a complete evaluation plan using the four-layer framework.
Expected Outcome: A one-page document with:
- Golden set: 10 example test cases with expected outputs
- Slice definitions: 3-5 slices relevant to your use case
- Metric hierarchy: north star, diagnostic metrics, operational metrics
- A/B test design: OEC, sample size estimate, rollout schedule
- Monitoring alerts: 3 drift signals with thresholds
Time Required: 2-3 hours
Hint (if you get stuck)
Start with the business outcome, not the model. Ask: "What happens if this AI system fails?" The answer tells you what to test most aggressively. If failure means a customer gets the wrong answer, prioritize accuracy testing on your highest-value segments. If failure means a 10-second delay, prioritize latency testing under load.
Solution (Support Classification Example)
Golden set (10 examples):
- "I can't log in" → authentication (critical)
- "My payment failed" → billing (critical)
- "When does my trial end?" → billing
- "How do I export CSV?" → feature_question
- "Your product is broken" → bug_report (critical)
- "I want to cancel" → churn_risk (critical)
- "" (empty) → should return "unknown" not crash
- "asdfghjkl" → should return "unknown"
- "I can't log in and my payment failed and I want to cancel" → multi-intent handling
- "Estoy tratando de iniciar sesion" → non-English handling
Slices: Enterprise customers, free-tier users, tickets with attachments, non-English tickets, tickets created during business hours vs off-hours.
Metric hierarchy:
- North star: Cost per resolved ticket
- Diagnostic: Auto-resolution rate, escalation rate, re-open rate, CSAT
- Operational: Classification latency p95, confidence distribution, error rate
A/B test: OEC = cost per resolved ticket. Need 30,000 tickets per variant. At 2,000/day, run for 15 days per variant. Progressive rollout: 1% (3 days) → 10% (4 days) → 50% (15 days).
Monitoring alerts: Input drift (PSI above 0.2, check daily). Accuracy on golden set (below 92%, check on every deployment). Cost per ticket (above 10% increase week-over-week, check weekly).
Key Takeaways
- Test in four layers: Unit tests catch model behavior bugs. Slice-based evaluation catches hidden failures in critical segments. Integration tests catch system-level issues. Business validation catches the gap between model metrics and business outcomes.
- Metrics must map to money: Every model metric should trace back to a business metric. If you can't explain why a metric matters in dollar terms, you're measuring the wrong thing.
- A/B test with progressive rollout: Shadow test first, then start at 1% traffic. Catch catastrophic failures before they reach most users. Always monitor counter-metrics alongside your primary metric.
- Monitoring is not optional: Models degrade silently. Build input drift, output drift, and performance drift monitoring before you deploy, not after the first incident.
Quick Reference
| Concept | Definition | Example |
|---|---|---|
| Golden Set | Curated test cases that every model version must pass | 50 critical support tickets with verified correct classifications |
| Slice-Based Evaluation | Breaking performance metrics by meaningful subgroups | Accuracy by customer tier, input language, ticket complexity |
| Shadow Testing | Running a new model on production traffic without serving its outputs | Logging new model predictions alongside production model for comparison |
| Input Drift | Change in the distribution of data entering the model over time | New product category creates tickets the model hasn't seen |
| OEC | Overall Evaluation Criterion — the single business metric for A/B tests | Cost per resolved ticket, revenue per recommendation |
| PSI | Population Stability Index — measures distribution shift between datasets | PSI above 0.2 indicates significant drift requiring investigation |
Course Complete
This is the final lesson in the Enterprise AI Implementation Guide. Over six lessons, you've learned how to:
- Assess your AI readiness across six pillars
- Build a business case that survives CFO scrutiny
- Assemble the right team in the right order
- Design a data strategy that prevents the most common failure mode
- Choose integration patterns that match your use case and budget
- Build the testing and evaluation infrastructure that keeps your AI system working in production
The gap between AI proof-of-concept and production deployment is where most projects die. The companies that cross it aren't the ones with the best models — they're the ones with the best engineering practices around those models. Testing, evaluation, and monitoring are that engineering.
If you're ready to put this into practice and want a team that's done it before, let's talk.
FAQ
How much time should we spend on testing vs building?
Plan for a 40/60 split. 40% of your project timeline should go to evaluation infrastructure, testing, and monitoring. 60% goes to data preparation, model development, and integration. Most teams spend under 10% on testing and pay for it in production incidents. If you followed the 12-week timeline from the POC-to-production guide, weeks 7-10 should be heavy on evaluation.
What if we don't have ground truth labels for monitoring?
You have three options. First, use human-in-the-loop sampling: randomly sample 1-5% of production predictions and get human labels weekly. Second, use proxy metrics: if you can't label accuracy directly, measure downstream indicators like customer re-contacts (if the AI answered wrong, the customer comes back). Third, use LLM-as-judge: use a more capable model to evaluate the production model's outputs. This is imperfect but catches the worst failures. Most mature teams combine all three.
Can we skip A/B testing for internal tools?
You can skip the progressive rollout, but you should still measure before and after. Run the old process and new AI-powered process in parallel for two weeks. Compare on your north star metric. Internal tools have lower risk but the same need for evidence that the AI is actually helping. The number of "AI-powered" internal tools that are slower than the spreadsheet they replaced is embarrassingly high.
Need help with AI implementation?
We build production AI systems that actually ship. Not demos, not POCs—real systems that run your business.
Get in Touch