Built evaluation system to test AI agent responses — 811.8k tokens — Promptbook.gg