The best voice-agent metrics connect directly to a product or engineering decision. A score that cannot tell the team what to inspect or change creates reporting work without improving the agent.

Task completion measures whether the customer reached the intended outcome. Resolution quality asks whether that outcome was correct and complete. Unsupported-claim detection catches answers that are not grounded in available policy or data.

Fallback quality should distinguish a safe recovery from an unhelpful refusal or repeated loop. Operational signals such as latency, interruptions, silence, and premature termination add the audio context that transcript-only evaluation misses.

Use a small stable scorecard first. Store the rationale and supporting conversation evidence alongside every score. Revisit thresholds only after comparing evaluator judgments with human reviewers on representative calls.

Report trends by agent, use case, and time period, but preserve a direct path back to individual conversations. Aggregates reveal where to look; evidence explains what to fix.