Transformers

Can someone please explain what is short SFT step ?

#9
by gshasiri - opened

According to the following recipe image, it seems like we take the LC model and do short SFT with non thinking data.

But according to the following description from the blog, it seems like we do short SFT on the Mid Trained Checkpoint.

"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."

image

Sign up or log in to comment