Evaluations
The evaluation module enables you to conduct automated tests on your bots, ensuring optimal performance and identifying potential issues.
Overview
Evaluations leverage a Large Language Model (LLM) as a technical judge. For each user question in the test set, the module compares the bot's actual response with the expected answer. This approach helps:
- Prevent regressions.
- Assess model and prompt performance.
- Track improvements over time.
This illustration below shows how the LLM Evaluation Metric works:
- Input: Test case data (e.g., user query, LLM output, context) is provided.
- Scoring: The LLM Judge Scorer evaluates the response, generating a score and optional reasoning.
- Threshold Check: The score is compared to a threshold to determine if the test passes (✅) or fails (❌).
Steps to Use the Evaluation Module
1. Create an Evaluation Set
An evaluation set consists of:
- Questions: User queries to test the bot.
- Ideal Answers: Expected responses for each query.
You can create as many Q&A pairs as needed. Additionally, you can import or export these pairs in CSV format for easier management.
2. Configure an Evaluation
To set up an evaluation, provide the following details:
- Bot to Evaluate: The specific bot you want to test.
- Evaluation Set: Select the test set created earlier.
- AI Deployment: Specify the AI configuration to evaluate.
Once configured, the evaluation will run automatically.
Viewing and Managing Results
After the evaluation completes, you can:
- View Results: Click the Open button to see the test details. Results include:
- All test cases.
- An accuracy score.
- Comments from the LLM about each response (pass your mouse cursor over the "i").
- Analyze Trends: View bot accuracy trends in a chart format.
Rerunning Evaluations
You can rerun evaluations at any time. For each rerun:
- You’ll be prompted to provide a comment describing changes made to the bot or its configuration.
- These comments serve as a log to track the impact of modifications over time.
While reviewing the detailed results, you can select the date and time to compare the pas results.
Updated 9 days ago