I then used LangChain evaluators (GPT-4 as judge), and track everything in LangSmith. I made public links to the traces where you can inspect the runs.
I hope you find this helpful, and I am certainly open to feedback, criticisms, or ways to improve.