Evaluations

The evaluation module enables you to conduct automated tests on your bots, ensuring optimal performance and identifying potential issues.

Overview

Evaluations leverage a Large Language Model (LLM) as a technical judge. For each user question in the test set, the module compares the bot's actual response with the expected answer. This approach helps:

  • Prevent regressions.
  • Assess model and prompt performance.
  • Track improvements over time.

This illustration below shows how the LLM Evaluation Metric works:

  • Input: Test case data (e.g., user query, LLM output, context) is provided.
  • Scoring: The LLM Judge Scorer evaluates the response, generating a score and optional reasoning.
  • Threshold Check: The score is compared to a threshold to determine if the test passes (✅) or fails (❌).


Steps to Use the Evaluation Module

1. Create an Evaluation Set
An evaluation set consists of:

  • Questions: User queries to test the bot.
  • Ideal Answers: Expected responses for each query.

You can create as many Q&A pairs as needed. Additionally, you can import or export these pairs in CSV format for easier management.

The Test Data section

The Test Data section

The Test Cases view is basically a Question / Answer list

The Test Cases view is basically a Question / Answer list

You can Edit each test case if needed

You can Edit each test case if needed

Use the Import / Export feature to handle big datasets

Use the Import / Export feature to handle big datasets


2. Configure an Evaluation
To set up an evaluation, provide the following details:

  • Bot to Evaluate: The specific bot you want to test.
  • Evaluation Set: Select the test set created earlier.
  • AI Deployment: Specify the AI configuration to evaluate.


Once configured, the evaluation will run automatically.

Viewing and Managing Results

After the evaluation completes, you can:

  • View Results: Click the Open button to see the test details. Results include:
    • All test cases.
    • An accuracy score.
    • Comments from the LLM about each response (pass your mouse cursor over the "i").


  • Analyze Trends: View bot accuracy trends in a chart format.


Rerunning Evaluations

You can rerun evaluations at any time. For each rerun:

  • You’ll be prompted to provide a comment describing changes made to the bot or its configuration.
  • These comments serve as a log to track the impact of modifications over time.

While reviewing the detailed results, you can select the date and time to compare the pas results.

You can select past evaluatoins to compare / export the results

You can select past evaluatoins to compare / export the results