Paper about “A Meeting Assistant Benchmark for Long-Context Language Models” with a remarkable side-note:
We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4’s evaluation scores are correlated with human judges’, its ability to differentiate among more than three score levels may be limited.
This is with actual GPT-4, so gpt-4-0613, not GPT-4 Turbo.
(Keyword: ELITR-Bench)