Can someone please explain what is short SFT step ?

by gshasiri - opened Nov 20, 2025

Nov 20, 2025

According to the following recipe image, it seems like we take the LC model and do short SFT with non thinking data.

But according to the following description from the blog, it seems like we do short SFT on the Mid Trained Checkpoint.

"Combine the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the best performance. We were able to recover the base model’s RULER score on contexts up to 128k tokens."

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment