What is the agreement score, and how is it used to fine-tune an auto-complete evaluation form?

The agreement score measures how often the answers selected by Virtual Supervisor match those selected by a human evaluator. It is calculated both per question and as an overall metric for the evaluation form. A higher agreement score indicates stronger alignment between AI-generated evaluations and human judgment, while lower agreement highlights areas that may need refinement.

Teams use the agreement score to identify unclear questions, inconsistent scoring logic, or gaps in evaluation guidance, and to improve how effectively the auto-complete evaluation form performs at scale.

How to calculate and use the agreement score

Evaluate interactions using the form
- Click Menu > Analytics > Analytics Workspace > Interactions.
- Open an interaction and navigate to the Quality Summary tab.
- Click Create Evaluation.
- Select the Agent Auto-Complete evaluation form.
- Choose a human evaluator to manually review the auto-completed answers.
- Click Create to generate the evaluation.
Review and update the evaluation
- Review each question in the evaluation.
- Use the transcript as evidence to update any incorrect automated responses.
- Submit the completed evaluation.
Test the form across multiple interactions
- Repeat this process for at least 20 different interactions to ensure reliable agreement data.
Review agreement metrics
- Click Conversation Intelligence > Quality Management > Evaluation Forms.
- Open the latest published version of the evaluation form.
- Review the overall agreement score and the agreement score for each question.
- Agreement scores are color-coded to help quickly identify how closely Virtual Supervisor evaluations align with human evaluator responses:
  - Red (0–59%) – Low agreement. These questions may require review or refinement because AI-generated answers frequently differ from human evaluations.
  - Yellow (60–79%) – Moderate agreement. The question is partially aligned and may benefit from clearer wording, improved scoring guidance, or additional tuning.
  - Green (80–100%) – High agreement. AI-generated evaluations closely match human evaluator responses, indicating strong alignment.
Refine the form
- Identify questions with low agreement scores.
- Update question wording, answer options, or scoring logic based on observed discrepancies.
- Retest the form as needed until the desired agreement score is achieved.

Notes:

When editing an evaluation form, you can view the current agreement score and access the most recently published version directly from the editor.
Agreement Score is not supported for multi-select questions.
Questions with low (red) or moderate (yellow) agreement scores may be excluded from recent or future evaluations if they do not consistently meet agreement thresholds. Review the current evaluation form to verify which questions are actively included.

[NEXT] Was this article helpful?

Get user feedback about articles.