Papers
arxiv:2509.05440

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Published on Sep 5
Authors:
,

Abstract

A new direct-scoring method using synthetic summaries performs comparably to state-of-the-art pairwise evaluators in sample-level correlations for evaluating large-language models across different benchmarks.

AI-generated summary

As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For sample-level performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (+0.03), TopicalChat (-0.03), and HANNA (+0.05) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.05440 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.05440 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.05440 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.