LLM as a judge

Paper about “A Meeting Assistant Benchmark for Long-Context Language Models” with a remarkable side-note:

We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4’s evaluation scores are correlated with human judges’, its ability to differentiate among more than three score levels may be limited.

This is with actual GPT-4, so gpt-4-0613, not GPT-4 Turbo.

(Keyword: ELITR-Bench)

Written on April 1st , 2024 by Nils Durner. Last updated: April 1st , 2024