Back to all articlesacademy

Enterprise AI Lesson 06: Testing, Evaluation & Quality Assurance

Most AI teams skip testing until production breaks. Learn the evaluation framework, metric selection, A/B testing patterns, and QA practices that separate production-grade AI from expensive demos.

Lesson 6: Testing, Evaluation & Quality Assurance

Course: Enterprise AI Implementation Guide | Lesson 6 of 6

Listen to this lesson (2 min)
0:00--:--

What You'll Learn

By the end of this lesson, you will be able to:

  • Build an evaluation framework that catches failures before your users do
  • Select the right metrics for your AI use case (not just accuracy)
  • Design A/B tests that measure business impact, not just model performance
  • Implement continuous monitoring that detects degradation in production

Prerequisites

Before starting this lesson, make sure you've completed:

Or have equivalent experience with:

  • Deploying ML models or LLM-based systems to production
  • Basic understanding of statistical metrics (precision, recall, F1)

The Testing Gap That Kills AI Projects

Your model hits 94% accuracy on the test set. The team celebrates. You deploy to production. Within two weeks, support tickets double, customers complain about wrong answers, and leadership starts asking hard questions.

This happens because most AI teams confuse offline evaluation with production readiness. They test model performance on historical data, declare success, and ship. What they don't test: how the model behaves on edge cases, how it degrades over time, how it handles data that looks nothing like the training set, and whether "94% accuracy" actually translates to business value.

A Gartner prediction warns that over 40% of agentic AI projects will be scrapped by 2027 because they fail to deliver business value — not because the models were bad, but because teams didn't build the evaluation infrastructure to catch problems early.

Here's the uncomfortable truth: testing is where the real engineering happens. The model is the easy part. The evaluation framework that ensures the model keeps working in production — that's what separates expensive demos from production systems.

The Four-Layer Evaluation Framework

Production AI testing isn't one activity. It's four distinct layers, each catching different failure modes. Skip any layer, and you're flying blind in a specific dimension.

Layer 1: Unit Testing (Model Behavior)

Unit tests for AI check that the model produces correct outputs for known inputs. This isn't about aggregate metrics — it's about specific behaviors you need to guarantee.

What to test:

  • Golden set: 50-200 curated examples that represent your critical use cases. Every model update must pass these before deployment.
  • Edge cases: Inputs at the boundaries of what your model should handle. Empty inputs, extremely long inputs, adversarial inputs, inputs in unexpected formats.
  • Invariance tests: Changing irrelevant features shouldn't change the output. If you rephrase "What's your return policy?" as "What is your return policy?", the answer should be the same.
  • Directional tests: Changing relevant features should change the output predictably. A customer with 3 overdue invoices should get a higher risk score than one with 0.
# Example: Golden set test for a support classification model
def test_golden_set():
    golden_examples = [
        {"input": "I can't log in to my account", "expected": "authentication"},
        {"input": "When will my order arrive?", "expected": "shipping"},
        {"input": "I want a refund", "expected": "billing"},
        {"input": "How do I export my data?", "expected": "feature_question"},
    ]

    results = [model.classify(ex["input"]) for ex in golden_examples]
    accuracy = sum(
        r == ex["expected"] for r, ex in zip(results, golden_examples)
    ) / len(golden_examples)

    assert accuracy >= 0.95, f"Golden set accuracy {accuracy} below 95% threshold"

# Example: Invariance test
def test_invariance_rephrasing():
    pairs = [
        ("What's your return policy?", "What is your return policy?"),
        ("How do I cancel?", "I need to cancel my subscription"),
        ("Pricing info", "What are your prices?"),
    ]
    for original, rephrased in pairs:
        assert model.classify(original) == model.classify(rephrased), \
            f"Classification changed between '{original}' and '{rephrased}'"

Frequency: Run on every model update, every prompt change, every data pipeline change. Automate in CI/CD.

Layer 2: Slice-Based Evaluation (Fairness and Coverage)

Aggregate metrics hide critical failures. A model with 92% overall accuracy might have 60% accuracy on your highest-value customer segment. Slice-based evaluation breaks performance down by meaningful groups.

Critical slices to evaluate:

Slice DimensionWhy It MattersExample
Customer segmentHigh-value customers may have different patternsEnterprise vs SMB support queries
Data volumeModels often fail on sparse categoriesRare product returns vs common ones
Time periodData drift hits recent data firstLast 7 days vs last 90 days
Input complexityEdge cases cluster in complex inputsMulti-topic support tickets
Geographic/demographicBias detection and fairnessRegional language variations
# Slice-based evaluation
def evaluate_by_slice(test_data, model):
    slices = {
        "enterprise": [x for x in test_data if x["segment"] == "enterprise"],
        "smb": [x for x in test_data if x["segment"] == "smb"],
        "recent_7d": [x for x in test_data if x["age_days"] <= 7],
        "high_value": [x for x in test_data if x["ltv"] > 10000],
    }

    for slice_name, slice_data in slices.items():
        accuracy = compute_accuracy(model, slice_data)
        print(f"{slice_name}: {accuracy:.2%} (n={len(slice_data)})")

        # Alert if any slice drops below threshold
        if accuracy < 0.85:
            alert(f"DEGRADATION: {slice_name} accuracy at {accuracy:.2%}")

This is where you catch the failures that matter most. A 2% drop in overall accuracy might mean a 15% drop in accuracy for your enterprise customers — the ones generating 80% of revenue.

Layer 3: Integration Testing (System Behavior)

AI models don't run in isolation. They're part of a system with APIs, databases, queues, and user interfaces. Integration tests verify the system works end-to-end — not just the model in isolation.

What integration tests cover:

  • Latency: The model returns results within your SLA (typically under 2 seconds for user-facing, under 30 seconds for batch). A model that takes 8 seconds per query is useless for real-time support even if it's highly accurate.
  • Throughput: The system handles your expected traffic without degradation. Test at 2x your peak load.
  • Error handling: What happens when the model fails? Does the system return a graceful fallback, retry, or crash?
  • Data flow: Input data reaches the model correctly. Model outputs reach downstream systems correctly. No data corruption in transit.
  • Fallback behavior: When the model's confidence is below threshold, does the human handoff work? When the API is down, does the queue-based fallback activate?

If you built your integration following the API wrapping or RAG patterns from Lesson 5, your test strategy changes. API-wrapped systems need latency and rate-limit tests. RAG systems need retrieval quality tests on top of generation quality tests.

Layer 4: Business Validation (Impact Testing)

This is the layer most teams skip entirely — and it's the one that determines whether your project survives past quarter two.

Business validation answers one question: does the AI system produce the business outcome you promised in the business case?

If you built your business case in Lesson 2 with a projected 40% cost reduction in support, your evaluation framework must measure actual cost reduction — not just ticket deflection rate or model accuracy.

Map model metrics to business metrics:

Model MetricBusiness MetricWhy They Diverge
Classification accuracyCustomer satisfaction (CSAT)Wrong classification on high-urgency tickets destroys CSAT even if overall accuracy is high
Response generation qualityFirst-contact resolution rateA "correct" response that's confusing still generates follow-up tickets
Processing speedThroughput cost per unitFaster processing only saves money if it reduces headcount or infrastructure
Fraud detection rateNet fraud lossesCatching 99% of fraud but false-flagging 5% of legitimate transactions costs more than the fraud itself

Selecting the Right Metrics

The biggest mistake in AI evaluation is choosing metrics based on what's easy to measure instead of what matters. Here's how to select metrics that actually drive decisions.

The Metric Selection Framework

Step 1: Start with the business outcome. What did you promise stakeholders? Cost reduction, revenue increase, quality improvement, speed improvement. Write it down.

Step 2: Identify the proxy metric. What measurable quantity correlates with that business outcome? For cost reduction in support, that might be auto-resolution rate. For quality improvement in manufacturing, that might be defect detection rate.

Step 3: Define the counter-metric. Every optimization has a cost. If you optimize for auto-resolution rate, the counter-metric is false resolution rate (tickets marked resolved that come back). If you optimize for fraud detection, the counter-metric is false positive rate.

Step 4: Set thresholds, not targets. "Maximize accuracy" is not actionable. "Accuracy must stay above 90% while false positive rate stays below 3%" is actionable. Thresholds give you clear go/no-go criteria for deployment.

Metric Cheat Sheet by Use Case

Use CasePrimary MetricCounter-MetricThreshold Example
Support classificationPrecision per categoryMisrouting ratePrecision above 90%, misrouting under 5%
Document extractionField-level accuracyManual review rateAccuracy above 95%, manual review under 10%
Fraud detectionRecall (catch rate)False positive rateRecall above 95%, FPR under 2%
Demand forecastingMAPE (error %)Bias (over vs under-predict)MAPE under 15%, bias within +/- 3%
Content generationHuman preference rateHallucination ratePreference above 70%, hallucination under 2%
Voice agentsTask completion rateCustomer escalation rateCompletion above 60%, escalation under 15%

The Metric Hierarchy

Not all metrics are equal. Structure them in three tiers:

  1. North star metric (1 metric): The single number that maps to your business case. Report this to executives. Example: cost per resolved support ticket.
  2. Diagnostic metrics (3-5 metrics): The metrics that explain why the north star moved. Example: auto-resolution rate, average handle time, escalation rate, CSAT.
  3. Operational metrics (5-10 metrics): The metrics your engineering team watches daily. Example: model latency p95, token usage, retrieval accuracy, cache hit rate, error rate.

When the north star drops, you look at diagnostic metrics to find the cause. When a diagnostic metric drops, you look at operational metrics to find the root cause. This hierarchy prevents alert fatigue — you're not watching 50 dashboards.

A/B Testing AI Systems in Production

Offline evaluation tells you if a model could work. A/B testing tells you if it does work with real users, real data, and real business impact. Statsig's research confirms that F1 score improvements don't always translate to business metric improvements — you need production A/B tests to verify.

The A/B Testing Protocol

Step 1: Define your Overall Evaluation Criterion (OEC).

Pick one metric that captures success. Not accuracy, not latency — the business metric. Revenue per user, cost per ticket resolved, defect escape rate. This is what you're optimizing for.

Step 2: Calculate sample size before you start.

AI A/B tests need larger sample sizes than UI tests because the effect sizes are smaller. A 2% improvement in auto-resolution rate might need 50,000 tickets per variant to reach statistical significance at 95% confidence. If you get 1,000 tickets per day, that's 50 days per variant — plan accordingly.

Step 3: Use progressive rollout, not 50/50 splits.

Day 1-3:    1% traffic to new model (smoke test)
Day 4-7:    5% traffic (early signal)
Day 8-14:   25% traffic (significance builds)
Day 15-21:  50% traffic (full A/B test)
Day 22+:    100% or rollback (decision)

This catches catastrophic failures before they affect most users. If the new model starts generating harmful content or crashing on a specific input pattern, you catch it at 1% traffic, not 50%.

Step 4: Monitor counter-metrics continuously.

Your new model might improve auto-resolution by 5% while increasing CSAT complaints by 8%. Without counter-metric monitoring, you'd declare victory based on the primary metric and create a bigger problem.

Shadow Testing: Test Without Risk

Before A/B testing, run the new model in shadow mode: it processes real production traffic but its outputs are logged, not served to users. Compare shadow outputs to the production model's outputs.

# Shadow testing architecture
async def handle_request(input_data):
    # Production model serves the response
    production_result = await production_model.predict(input_data)

    # Shadow model runs in parallel, output logged only
    shadow_result = await shadow_model.predict(input_data)

    # Log both for comparison
    log_comparison(input_data, production_result, shadow_result)

    # Only production result is returned to user
    return production_result

Shadow testing answers the question "would the new model have been better?" without any user-facing risk. Run shadow tests for at least one full business cycle (typically one week) before starting A/B tests.

Continuous Monitoring in Production

Deploying a model isn't the finish line — it's the starting point for monitoring. Models degrade. Data drifts. User behavior changes. Without continuous monitoring, your 92% accuracy model silently drops to 75% over three months and nobody notices until the business impact becomes visible.

The Three Signals of Model Degradation

1. Input drift: The data coming into your model starts looking different from what it was trained on. New product categories, seasonal patterns, market shifts. Monitor input distributions weekly and alert on significant changes.

2. Output drift: The model's predictions shift even if inputs haven't changed noticeably. This happens with LLMs when API providers update model versions silently. Monitor output distribution and confidence score distributions.

3. Performance drift: The actual accuracy drops as measured against ground truth labels. This requires a labeling pipeline — automatically sample production predictions and get human labels to measure ongoing accuracy.

# Monitoring setup
class ModelMonitor:
    def __init__(self, baseline_stats):
        self.baseline = baseline_stats

    def check_input_drift(self, recent_inputs):
        """Compare recent input distribution to training distribution"""
        drift_score = compute_psi(self.baseline["input_dist"],
                                   compute_distribution(recent_inputs))
        if drift_score > 0.2:  # PSI threshold
            alert("INPUT_DRIFT", f"PSI={drift_score:.3f}")

    def check_output_drift(self, recent_outputs):
        """Compare recent output distribution to baseline"""
        drift_score = compute_psi(self.baseline["output_dist"],
                                   compute_distribution(recent_outputs))
        if drift_score > 0.15:
            alert("OUTPUT_DRIFT", f"PSI={drift_score:.3f}")

    def check_performance(self, recent_labeled):
        """Compare recent accuracy to baseline"""
        current_accuracy = compute_accuracy(recent_labeled)
        if current_accuracy < self.baseline["accuracy"] * 0.95:  # 5% drop
            alert("PERFORMANCE_DRIFT",
                  f"Accuracy dropped from {self.baseline['accuracy']:.2%} "
                  f"to {current_accuracy:.2%}")

The Monitoring Dashboard

Every production AI system needs a monitoring dashboard with these views:

ViewRefresh RateAudience
Real-time error rate and latencyEvery minuteOn-call engineer
Daily accuracy on sampled predictionsDailyML team
Weekly drift scores and distribution plotsWeeklyML team + PM
Monthly business metric impactMonthlyLeadership

The tooling landscape has matured significantly. Platforms like Braintrust, Arize, and Evidently AI provide production monitoring out of the box, with trace-level debugging that connects production failures to specific model behaviors. If your team is building MLOps infrastructure from scratch in 2026, you're reinventing a solved problem.

Exercise: Build Your Evaluation Plan

Put your learning into practice:

Task: Pick an AI use case from your organization (or use the customer support classification example from this lesson). Build a complete evaluation plan using the four-layer framework.

Expected Outcome: A one-page document with:

  1. Golden set: 10 example test cases with expected outputs
  2. Slice definitions: 3-5 slices relevant to your use case
  3. Metric hierarchy: north star, diagnostic metrics, operational metrics
  4. A/B test design: OEC, sample size estimate, rollout schedule
  5. Monitoring alerts: 3 drift signals with thresholds

Time Required: 2-3 hours

Hint (if you get stuck)

Start with the business outcome, not the model. Ask: "What happens if this AI system fails?" The answer tells you what to test most aggressively. If failure means a customer gets the wrong answer, prioritize accuracy testing on your highest-value segments. If failure means a 10-second delay, prioritize latency testing under load.

Solution (Support Classification Example)

Golden set (10 examples):

  • "I can't log in" → authentication (critical)
  • "My payment failed" → billing (critical)
  • "When does my trial end?" → billing
  • "How do I export CSV?" → feature_question
  • "Your product is broken" → bug_report (critical)
  • "I want to cancel" → churn_risk (critical)
  • "" (empty) → should return "unknown" not crash
  • "asdfghjkl" → should return "unknown"
  • "I can't log in and my payment failed and I want to cancel" → multi-intent handling
  • "Estoy tratando de iniciar sesion" → non-English handling

Slices: Enterprise customers, free-tier users, tickets with attachments, non-English tickets, tickets created during business hours vs off-hours.

Metric hierarchy:

  • North star: Cost per resolved ticket
  • Diagnostic: Auto-resolution rate, escalation rate, re-open rate, CSAT
  • Operational: Classification latency p95, confidence distribution, error rate

A/B test: OEC = cost per resolved ticket. Need 30,000 tickets per variant. At 2,000/day, run for 15 days per variant. Progressive rollout: 1% (3 days) → 10% (4 days) → 50% (15 days).

Monitoring alerts: Input drift (PSI above 0.2, check daily). Accuracy on golden set (below 92%, check on every deployment). Cost per ticket (above 10% increase week-over-week, check weekly).

Key Takeaways

  1. Test in four layers: Unit tests catch model behavior bugs. Slice-based evaluation catches hidden failures in critical segments. Integration tests catch system-level issues. Business validation catches the gap between model metrics and business outcomes.
  2. Metrics must map to money: Every model metric should trace back to a business metric. If you can't explain why a metric matters in dollar terms, you're measuring the wrong thing.
  3. A/B test with progressive rollout: Shadow test first, then start at 1% traffic. Catch catastrophic failures before they reach most users. Always monitor counter-metrics alongside your primary metric.
  4. Monitoring is not optional: Models degrade silently. Build input drift, output drift, and performance drift monitoring before you deploy, not after the first incident.

Quick Reference

ConceptDefinitionExample
Golden SetCurated test cases that every model version must pass50 critical support tickets with verified correct classifications
Slice-Based EvaluationBreaking performance metrics by meaningful subgroupsAccuracy by customer tier, input language, ticket complexity
Shadow TestingRunning a new model on production traffic without serving its outputsLogging new model predictions alongside production model for comparison
Input DriftChange in the distribution of data entering the model over timeNew product category creates tickets the model hasn't seen
OECOverall Evaluation Criterion — the single business metric for A/B testsCost per resolved ticket, revenue per recommendation
PSIPopulation Stability Index — measures distribution shift between datasetsPSI above 0.2 indicates significant drift requiring investigation

Course Complete

This is the final lesson in the Enterprise AI Implementation Guide. Over six lessons, you've learned how to:

The gap between AI proof-of-concept and production deployment is where most projects die. The companies that cross it aren't the ones with the best models — they're the ones with the best engineering practices around those models. Testing, evaluation, and monitoring are that engineering.

If you're ready to put this into practice and want a team that's done it before, let's talk.


FAQ

How much time should we spend on testing vs building?

Plan for a 40/60 split. 40% of your project timeline should go to evaluation infrastructure, testing, and monitoring. 60% goes to data preparation, model development, and integration. Most teams spend under 10% on testing and pay for it in production incidents. If you followed the 12-week timeline from the POC-to-production guide, weeks 7-10 should be heavy on evaluation.

What if we don't have ground truth labels for monitoring?

You have three options. First, use human-in-the-loop sampling: randomly sample 1-5% of production predictions and get human labels weekly. Second, use proxy metrics: if you can't label accuracy directly, measure downstream indicators like customer re-contacts (if the AI answered wrong, the customer comes back). Third, use LLM-as-judge: use a more capable model to evaluate the production model's outputs. This is imperfect but catches the worst failures. Most mature teams combine all three.

Can we skip A/B testing for internal tools?

You can skip the progressive rollout, but you should still measure before and after. Run the old process and new AI-powered process in parallel for two weeks. Compare on your north star metric. Internal tools have lower risk but the same need for evidence that the AI is actually helping. The number of "AI-powered" internal tools that are slower than the spreadsheet they replaced is embarrassingly high.

Need help with AI implementation?

We build production AI systems that actually ship. Not demos, not POCs—real systems that run your business.

Get in Touch