How to reproduce the results in your blog?

#7
by 141forever - opened

At present, my student model is LLaMA3.2-1B, and the teacher model is Qwen3-4B. The training and testing data I am using is the CountDown Qwen3-4B version (27.7K).
I am training on two GPUs, and the training hyperparameters are as follows:
'''
training_args = GOLDConfig(
save_strategy="steps",
save_steps=500,
learning_rate=5e-5,
warmup_ratio=0.05,
per_device_train_batch_size=16,
max_completion_length = 512,
teacher_model_name_or_path=teacher_name,
teacher_tokenizer_name_or_path=teacher_name,
bf16=True,
use_uld_loss=True,
uld_use_hybrid_loss=True,
push_to_hub=False,
report_to=[],
lr_scheduler_type = 'cosine',
num_train_epochs=5,
max_steps=3000,
logging_steps=10,
gradient_accumulation_steps=1,
lmbda = 1.0,
beta = 0.0,
uld_crossentropy_weight = 0.0,
uld_distillation_weight = 1.0,
)
'''
Currently, the loss can decrease from 1.8 to around 0.1. However, the trained model is unable to generate outputs in the required and format. Moreover, its responses are almost irrelevant to the questions and do not form complete sentences.

141forever changed discussion title from How to How to reproduce the results in your blog?
Hugging Face H4 org

Are you using the model for tasks different from Countdown? Just curious about your setup.

Are you using the model for tasks different from Countdown? Just curious about your setup.

Nope, just this Countdown task.

Hugging Face H4 org

Do you see any spikes in the loss function? We saw that spikes during training made the model unusable, even if the loss decreased afterward. That's why we set low learning rates (1e-7) during training. Anecdotally, learning rates above 1e-6 caused spikes and made training the model difficult.

Also, what is your effective batch size? That's also an important parameter to consider. Ours was 32, but we've seen that a larger batch size helps in more recent experiments.

Do you see any spikes in the loss function? We saw that spikes during training made the model unusable, even if the loss decreased afterward. That's why we set low learning rates (1e-7) during training. Anecdotally, learning rates above 1e-6 caused spikes and made training the model difficult.

Also, what is your effective batch size? That's also an important parameter to consider. Ours was 32, but we've seen that a larger batch size helps in more recent experiments.

Distillation works quite well when the teacher and student share the same vocabulary, but in the cross-vocabulary setting we really couldn’t get it to train properly. I’m also not sure how you and Thomasip managed to make it work.

Our learning rate is 1e-5. We used two GPUs, with a per-GPU batch size of 16, so the effective batch size is 32.

I suspect the issue might be here: different models use different padding tokens. For example, is a single token for LLaMA, but Qwen tokenizes it into five tokens: <, e, o, s, > (included in the answer segments). This can significantly increase the max_length within a batch, which may hurt performance.

I am now fixing this bug to find out.

I have already fixed the bug caused by the misalignment of special tokens across vocabularies.
However, after setting lambda to 0.25 and the learning rate to 1e-7, I observe relatively large fluctuations in the loss, and it occasionally drops to 0.

May I ask whether I could have access to all the training hyperparameters you used for cross-vocabulary training (with meta-llama/Llama-3.2-1B-Instruct as the student and Qwen/Qwen3-4B-Instruct-2507 as the teacher)?
I already know your learning rate and effective batch size, but I would like to know the values of max_completion_length, max_length, lambda, and beta. Also, could you confirm whether your Llama-3.2-1B-Instruct is the original version?

BTW, after cross-vocabulary training, Gemma3-1B achieves a result of 0.08, which is slightly higher than the 0.03 in your report. This suggests that the algorithm and the code implementation themselves are likely correct, so the issue of Llama-3.2-1B-Instruct is probably related to the hyperparameters.

Hugging Face H4 org

Here is the config we used to distill to the Llama model.

# GBS of 32 with 2 nodes for training and 1 for vLLM server
model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
student_model_revision: main
teacher_model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
teacher_tokenizer_name_or_path: Qwen/Qwen3-4B-Instruct-2507
dataset_name: HuggingFaceTB/Countdown-Task-GOLD
dataset_config: verified_Qwen2.5-7B-Instruct
eos_token: <|eot_id|>
attn_implementation: flash_attention_2

learning_rate: 1e-7
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
warmup_ratio: 0.05
per_device_train_batch_size: 1
gradient_accumulation_steps: 2

num_train_epochs: 5
max_length: 4096
max_new_tokens: 2048
use_uld_loss: true
use_extended_uld: true
uld_use_hybrid_loss: true
uld_crossentropy_weight: 0.0
uld_distillation_weight: 1.0
uld_student_temperature: 1.0
uld_teacher_temperature: 1.0
lmbda: 1.0
beta: 0.0
bf16: true

Let me know if this helps with reproducing the results.

Also, could you please share how you fixed the misalignment of the special tokens across vocabularies?

Here is the config we used to distill to the Llama model.

# GBS of 32 with 2 nodes for training and 1 for vLLM server
model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
student_model_revision: main
teacher_model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
teacher_tokenizer_name_or_path: Qwen/Qwen3-4B-Instruct-2507
dataset_name: HuggingFaceTB/Countdown-Task-GOLD
dataset_config: verified_Qwen2.5-7B-Instruct
eos_token: <|eot_id|>
attn_implementation: flash_attention_2

learning_rate: 1e-7
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
warmup_ratio: 0.05
per_device_train_batch_size: 1
gradient_accumulation_steps: 2

num_train_epochs: 5
max_length: 4096
max_new_tokens: 2048
use_uld_loss: true
use_extended_uld: true
uld_use_hybrid_loss: true
uld_crossentropy_weight: 0.0
uld_distillation_weight: 1.0
uld_student_temperature: 1.0
uld_teacher_temperature: 1.0
lmbda: 1.0
beta: 0.0
bf16: true

Let me know if this helps with reproducing the results.

Also, could you please share how you fixed the misalignment of the special tokens across vocabularies?

Thank you very much for your help. I will give it a try.

  1. I would also like to ask what the loss curve looks like under this set of hyperparameters. In my case, the loss only decreases when I enable off-policy training.
  2. In your report, you mentioned that the performance improves significantly after 1000 steps. With gradient_accumulation_steps = 2, does that mean it would take 2000 steps to observe a similarly significant improvement?

I fix the bug in 'compute_loss' function after completion_texts is assigned :

completion_texts_unpadded = []
for i in range(len(completion_texts)): 
    text_now = completion_texts[i]
    while text_now.endswith(self.student_tokenizer.pad_token):
        text_now = text_now[:-len(self.student_tokenizer.pad_token)]
        completion_texts_unpadded.append(text_now)

build_teacher_inputs_from_texts(
                ...,
                completion_texts_unpadded, 
            )

We must remove the suffix padding from the student outputs, because these special tokens cannot be recognized by the teacher model and would otherwise be treated as part of the on-policy response when computing the loss.

Hugging Face H4 org

I would also like to ask what the loss curve looks like under this set of hyperparameters. In my case, the loss only decreases when I enable off-policy training.

The loss and grad norms throughout training are the ones below
image

does that mean it would take 2000 steps to observe a similarly significant improvement?

The results on the blog are with gradient_accumulation_steps = 2, so the conclusions are based on the config I sent previously.

I would also like to ask what the loss curve looks like under this set of hyperparameters. In my case, the loss only decreases when I enable off-policy training.

The loss and grad norms throughout training are the ones below
image

does that mean it would take 2000 steps to observe a similarly significant improvement?

The results on the blog are with gradient_accumulation_steps = 2, so the conclusions are based on the config I sent previously.

Thank you. It seems our effective batch size (EBS) and gradient accumulation are consistent.

At around 2,000 steps, using the Llama-3.2-1B-Instruct model we can only reach about 0.17 pass@1 on the 10k test set, but from your plot it looks like it can reach around 0.3.

So I’d like to ask:

  1. What evaluation settings/parameters did you use? We use this script for evaluation, but we’re not sure what the evaluation parameters are: https://gist.github.com/cmpatino/2270db038f93e8714f8fb213ff60f48f

  2. Where did you get your Llama-3.2-1B model from? Before training, you mentioned getting a score of around 0.02, but on my side it is always 0 (I downloaded it from ModelScope).

  3. In your blog(https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation), is the evaluation set the full 10k version, or is it a split from the training set?

Hugging Face H4 org

What evaluation settings/parameters did you use? We use this script for evaluation, but we’re not sure what the evaluation parameters are:
I think the only parameters that are not included in this script are the sampling params.

sampling_params = SamplingParams(
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        max_tokens=4096,
        n=4,
 )

Where did you get your Llama-3.2-1B model from? Before training, you mentioned getting a score of around 0.02, but on my side it is always 0 (I downloaded it from ModelScope).

I'm getting the model directly from Hugging Face through transformers. I'm not modifying the model in any way.

In your blog(https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation), is the evaluation set the full 10k version, or is it a split from the training set?

Yes, I'm using the full 10k samples in the test set for the eval.

Thank you for your help. We eventually identified the issue as an inconsistency between training and inference. Specifically, when running inference with the same model weights using Hugging Face’s native inference framework versus using vLLM, even with identical parameters and the same chat template, the outputs can differ substantially—likely due to internal engine differences, or because this kind of white-box distillation is overly sensitive to padding format.

We also found that the version of gold_trainer.py from last November was more stable than the current one (the probability aggregation part differs quite a lot). This suggests the current version still has significant room for optimization.

Hugging Face H4 org

Thank you for reporting this.

FWIW, we are aware that the method has room for improvement. If you're interested in the topic, we recently had a discussion on GitHub about a potential way to improve it.

Thank you for reporting this.

FWIW, we are aware that the method has room for improvement. If you're interested in the topic, we recently had a discussion on GitHub about a potential way to improve it.

Sure, I will have a look next week. Thank you!

Sign up or log in to comment