Appearance
Evaluation Benchmarks
Evaluation Benchmarks helps you calibrate your evaluation criteria by comparing human scoring with Qualiteam’s AI evaluation. First and foremost, it lets you see how closely the AI matches your intended scoring and whether it interprets your criteria the way you expect — in other words, whether the AI is evaluating customer conversations correctly according to your rules.
How to use it
Create a group (a benchmark set) to organize the conversations you want to test.
Add conversations to the group.
Set your Expectations: manually score each conversation against your criteria (your “ground truth”).
Go to Execute & Results to run the same conversations through Qualiteam’s AI evaluation.
Review the side-by-side comparison: human score vs Qualiteam score (including the match percentage), then refine your criteria and rerun to improve alignment.
This workflow helps you calibrate the AI and feel confident that the evaluation matches your expectations. It’s also a great way to test and fine-tune new criteria and rules before applying them at scale.