CLEAR: Error Analysis via LLM-as-a-Judge Made Easy Paper โข 2507.18392 โข Published Jul 24, 2025 โข 19 โข 2
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving Paper โข 2504.02605 โข Published Apr 3, 2025 โข 48 โข 3
Survey on Evaluation of LLM-based Agents Paper โข 2503.16416 โข Published Mar 20, 2025 โข 95 โข 2
WildIFEval: Instruction Following in the Wild Paper โข 2503.06573 โข Published Mar 9, 2025 โข 14 โข 4
WildIFEval: Instruction Following in the Wild Paper โข 2503.06573 โข Published Mar 9, 2025 โข 14 โข 4
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models Paper โข 2502.08130 โข Published Feb 12, 2025 โข 9 โข 2