Test your custom copilot with Power CAT Copilot Studio Kit
The Microsoft Power Customer Advisory Team (Power CAT) has recently launched a new tool called Power CAT Copilot Studio Kit, which is an extensive suite of tools designed to enhance the capabilities of Microsoft Copilot Studio. The kit includes a variety of tools that allow you to put your custom copilots through some testing scenarios, ensuring they perform as expected in real-world applications.
This article will guide you through the steps to start testing your custom copilots using this innovative toolkit.
Basic concepts
After installing the kit (check prerequisits first), you can run the Copilot Studio Kit model-driven app, which currently shows the following options:
- Copilots: Basic information of the custom copilot that you want to test.
- Tests: Unit tests on the copilot. Those tests can be any of the following types: Response Match, Attachments, Topic Match and Generative Answers.
- Test Sets: It allows you to create a group of unit tests.
- Test Runs: List of all tests that have been run.
- Test Results: List of all test results.
You can watch this excellent video about what is Copilot Studio Kit, how to install it and how to create a very basic test by Dewain Robinson from Microsoft.
One of the standout features is the Large Language Model (LLM) validation tool. This feature checks the AI-generated content for accuracy and coherence, giving you the confidence that your copilot’s output meets quality and compliance standards. It’s like having an extra set of eyes, always ensuring that the content produced is of quality and suitable in the conversation context. Moreover, the upcoming feature to track aggregated key performance indicators is something to look forward to.
In the following paragraphs, we will demonstrate, step by step, how to configure and run tests on an existing copilot
1. Configure the copilot to test
In the Copilots section we need to enter some basic parameters about the copilot we want to test:
- Base configuration: Name, Region and Token Endpoint of the copilot. You can read the following article on how to get the Token Endpoint URL, while the Region depends on the location of the environment where you deployed (published) the copilot.
- Direct Line Channel Security: If enabled (because your copilot is published in a website, and want to restrict who can access it), you can configure where to store the channel secret. In our case, we only published the chatbot into Teams, so no need to configure it.
- Results Enrichment: Recommended to test Generative Answers results using Application Insights data. We used recommended values for all fields, and also configured Dataverse (environment) URL.
- Generative AI Testing: Define which LLM we want to use to test Generative Answers. Currently AI Builder is the only Generative AI Provider available to select.
We want to test a copilot called Store Operations, and the configuration section would be like the following:
2. Create tests
In the tests section we can create different types of tests as we already mentioned. Let’s create one of each type:
2.1. Response match test
This is the easiest test to configure, as we want to test that given a utterance (question), the copilot returns the expected (and fixed) answer.
As with other tests, we need to configure attributes like name or test set, but the most important ones are the following:
- Seconds before getting the answer: Latency on returning an answer may vary based on different factors. If unsure, Microsoft recommends to put 10 seconds and adjust based on our results.
- Test Utterance: Question or text to send to the copilot.
- Expected Response: Value expected to be returned by the copilot when the test utterance is sent to the copilot.
The copilot we are testing should always return the same answer when users asks about customer support times, so we can use a Response Match test:
We could also start the test sending a startConversation event, which will trigger the Greeting topic, although in this case we don’t need it.
2.2. Topic match test
What if we want to test which topics will be triggered according to the utterance sent by the user? This is what is this test about: it compares the expected topic name with the triggered one (when configuring the copilot, remember to turn Enrich with Conversation Transcripts setting on).
We created a topic in our copilot, called Track order, to manage order tracking:
This topic should be triggered whenever the user writes phrases like track an order or delivery details, and the copilot will ask the user a question: What is the order number?
Therefore, we could write a test like we show in the following screenshot:
In this case we used the same attributes we also used in the Response Match test and also defined the value for the Expected Topic Name attribute.
It is important to note that in this case, we are using classic approach in using generative AI in conversations. This means that we are using topics we build to respond to trigger phrases, and not using generative AI to determine the best topic to trigger (what we know in the past as dynamic chaining). This can be configured in the copilot settings and changed at anytime.
2.3. Generative Answers test
In our copilot we are using different two public websites to answer questions about specific devices and billing information:
When a user sends a query and the copilot is unable to find a match with the different topics, the conversational boosting system topic is triggered (as long as Allow the AI to use its own general knowledge (preview) setting is turned on). If that setting is off, then the triggered system topic would be the fallback one.
What if we want to test what the answer is when a user asks a question about a Microsoft device? Our copilot can use the Microsoft Store website to return an answer using the information in there. Therefore, let’s create a generative answer test:
And the most important attributes we need to configure in this case are:
- Expected Response: The response is not deterministic: if the information in the Microsoft website is updated, or the LLM used in copilot studio is also updated, the response to the same query may be different. As a result, we set a value that should be similar to what we expect as a response.
- Expected Generative Answers Outcome: Depending on the utterance and the knowledge sources we are using, we expect the outcome to be answered or not answered.
In this case, the tool is going to use AI Builder as Generative AI provider and check if the AI-generated answer is close to the sample answers or honors validation instructions. It is important to note that to run this type of tests, we need to enable Generative AI Testing in the copilot configuration.
2.4. Attachments (Adaptive Cards, etc.) test
Thanks to adaptive cards, a Copilot can present query results in a more user-friendly and clear manner. By leveraging them, a copilot can display information in a structured format, making it easier for users to understand and interact with the data. Wouldn’t it be great if we can also test results that display information using this format?
Within our copilot we created a topic to return some information about specific laptops that are stored into a Dataverse table. The information about the laptop is displayed using an adaptive card, and the topic is designed like the following:
Basically, when the user searches for a product, the copilot asks the user to enter the name of the product, call a Power Automate flow to search through the database, and return the information in an adaptive card (if the product is found).
Can we test that logic using Copilot Studio Kit? No! Why?
In this case, there is a dialog: first we need to trigger the topic, the copilot asks the user to enter product information, and finally returns the adaptive card if a product is found. This type of logic can’t be tested at the moment because it involves an interaction between the user and the copilot, and we can only test using one single interaction (query from the user, response from the copilot).
As a result, we created another topic that searches for a product and returns the adaptive card.
And this is the topic we are going to use to test the adaptive card result. When creating the test, we need to configure the Expected Attachments JSON value (the adaptive card content), among other attributes that we also used in previous tests.
Now that we have configured different types of tests, it’s time to put them altogether.
3. Create test sets
A test set is a collection of multiple tests grouped together. This allows you to run all the tests in the set simultaneously, making it easier to evaluate the performance and accuracy of your copilot. In the end, by organizing tests into sets, you can efficiently manage and execute comprehensive testing scenarios, ensuring your copilot performs as expected across various situations.
When we created the tests, you need to assign them to a specific copilot test set. Therefore, you have two options to create a test set:
- Create it from the Test Sets section.
- Create it from the New Test screen.
Be careful, if you delete a Test Set, you also delete all tests that it contains!
We created a test set that has one test of each type:
Of course we can create other test sets with new tests or a combinations of existing ones.
4. Create test runs
From the Test Runs sections we can create a test run, which is mainly based on selecting a Copilot Test Set and a Copilot Configuration.
As soon as we create the Test Run, all tests start running, and from the main screen we can go into the details of each one and check the results:
Some type of tests, like the Generative Answers or the Topic Match, can take some time to run, as we can see in the previous screenshot. In the first case, AI Builder is generating an answer in order to compare it with the expected one, while the latter is pending because conversation transcript needs to be analysed.
4. Analyse test results
If we want to analyse results for each test, we can go into the Test Results section, and then go through the details screen for each test result. Besides the Result attribute, we can also check which was the Triggered Topic, the Latency and the Result Reason.
As we can we see in the screenshot, one of the tests (Attachments (Adaptive Cards, etc.)) failed because the expected value was different from the actual result. It would be time to review the test and the copilot itself and find the reason why it failed.
As it is mentioned in the official documentation, some tests might take some time until they are completed. For instance, Topic Match tests are based on the copilot Dataverse Enrichment configuration, that in our case, it is configured to be delayed for 60 minutes before starting the analysis (recommended value by Microsoft).
If we enable the Copy Full Transcript option, this transcript will also be stored in the result details of each test, which can provide really useful information, but at the same time, we will need more storage in Dataverse).
5. Rerun tests (if needed)
In some cases it could be very useful to only run some specific tests within a Test Set. As we are able to inspect the solution from the Copilot Studio Kit, we can check that when we run a test, a Power Automate cloud flow called Run Copilot Tests is executed. At the same time, this flow calls 4 different child flows, which are the following:
- Enrich with App Insights: Generative Answers tests use Application Insights and AI Builder. More specifically, this cloud flow will run an analytics query to find events related to generative answers in Application Insights for the tested conversation, and after that, will trigger Analyse with AI Builder cloud flow.
- Analyse with AI Builder: All type Generative Answers tests will be executed by running this cloud flow, using AI Builder to get an answer (using a LLM) and comparing it to the expected result.
- Enrich with Dataverse: All type Topic Match tests need to access to the conversation transcripts stored in Dataverse. This cloud flow will check that there is a match between the triggered topic in a test with the expected one.
- Update Rollup columns: Cloud flow to update test runs results columns like number of successful tests, failed tests, average latency, etc.
We can call any of these workflows from the Test Runs screen:
Finally, if we want to run all tests again instead of using the individual cloud flows, we can use the Duplicate Run action. In this way, another test run will be created with completely new statistics.
Conclusions
As you may have realized, testing copilots is essential, especially when utilizing generative answers, to ensure accuracy, reliability, and user trust. The Copilot Studio Kit provides an excellent foundation for this process, offering tools that facilitate thorough testing and refinement. Its user-friendly interface make it an excellent choice for initial testing phases.
However, when it comes to more complex scenarios, manual testing remains indispensable in order to achieve comprehensive and accurate results. This combined approach leverages the strengths of both automated and manual testing, ultimately leading to a more reliable and effective copilots. Hopefully next versions of the toolkit will have new features that allows us to create more complex tests.