Differences in the results of the reproduction test on lm-evaluation-harness

by ThreeGold116 - opened Jan 14

Discussion

ThreeGold116

Jan 14

Ouro-1.4B R4 evaluation results between reproduction and paper

reproduction evaluation setting follow the paper
Result Comparsions

Benchmark Paper Result Reproduction Result

mmlu 67.35 66.74

bbh 71.02 60.77

gsm8k 78.92 60.80

Benchmark	Paper Result	Reproduction Result
mmlu	67.35	66.74
bbh	71.02	60.77
gsm8k	78.92	60.80

ridger

ByteDance org Jan 18

Sorry for the delay. I recently finished my internship at ByteDance, so I lost control of the repository for a period of time. Regarding the results, the MMLU scores are actually quite consistent (67.35 vs 66.74), likely because the paper reports log-prob results while we used a standard 5-shot setting in lm-eval. For the discrepancies in generate-until tasks like BBH and GSM8K, we used vLLM as the backend to speed up the evaluation since standard generation is quite slow. I suspect the performance drop is due to vLLM-specific behaviors rather than the model itself, so I wanted to ask if the original paper measurements were done without vLLM.

KristianS7

8 days ago

Hi @ThreeGold116 and @ridger ,

The reported performance drops might be due to using the chat template during the standard base evals. When using the chat template (apply_chat_template) in lm-eval, performance degrades significantly. When running the evaluations without the chat template (using raw text generation formatting), the results perfectly align with the paper.

Furthermore, I was able to reproduce the paper results using both the vllm and hf backends. This confirms that the backend choice wasn't the problem, only the prompt formatting.

Here is a subset of my reproduction experiments (vLLM results):

Benchmark (Metric / Setting)	Paper (1.4B R4)	Ours (with chat template)	Ours (NO chat template)
MMLU (acc, 5-shot)	67.35	66.54	67.46
BBH (exact_match (flexible), 3-shot CoT)	71.02	61.36	71.06
ARC-C (acc_norm, 25-shot)	60.92	57.34	60.41
GSM8K (exact_match (flexible), 3-shot CoT)	78.92	60.80	79.38

For reference, here are the environment versions I used for these runs:

vllm: 0.16.0
transformers: 4.57.6
lm-eval: 0.4.11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment