Open
Description
All of the samples I've seen are just one-off test methods. Can we have an explicit sample showing how a real app might use the library?
For example, if I have a chat agent that a customer is talking to, how should I evaluate the conversation? Run evaluate after every AI response with the same iteration id so the full conversation will eventually be evaluated and previous sub-conversation evaluations will be overwritten?