Voice agents fail in ways that aggregate dashboards hide. A call may complete while the agent gives an unsupported answer, misses an interruption, or leaves the customer without a clear resolution.

A useful QA process starts with production conversations. Import calls continuously, retain transcript and audio context, and evaluate the same small set of business-critical behaviors on every review cycle.

Begin with task completion, unsupported claims, fallback quality, and latency. Define what acceptable evidence looks like for each metric. Review failed conversations with the transcript and audio together, because timing and interruption problems are often invisible in text.

Treat scores as prioritization signals rather than absolute truth. Evaluator rationales should point reviewers to evidence, and uncertain or high-impact cases should remain subject to human review.

Close the loop by grouping repeated failures, assigning an owner, changing the agent, and checking new production calls for regression. The outcome is not a larger dashboard; it is a shorter path from a bad call to a verified fix.