Can someone please explain what is short SFT step ?
#9
by
gshasiri
- opened
According to the following recipe image, it seems like we take the LC model and do short SFT with non thinking data.
But according to the following description from the blog, it seems like we do short SFT on the Mid Trained Checkpoint.
"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."
