Lora Fine Tning- Several Issues

#25
by aetherforge - opened

Has anyone successfully trained this model? I have tried several ways including unsloth, Lora, etc. Error after error. Any tips or a training script that works on a L40S-180 would be awesome. Free access to L40S-90, 180, and 360(reasonable access, nothing crazy, you want to fine tune a model or 2, I'm good with that) to anyone that can help.

Quick context:setup 2× NVIDIA L40S on OVH (44 GB each, single node) PyTorch 2.5.1 + cu121 transformers 4.57.2 peft + trl installed from pip this week (can post exact versions if needed)
QLoRA, 4-bit, LoRA rank 32, target_modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] Training on 372 train examples + 93 val examples JSONL, already formatted to chat / SFT style. Model and data load fine, tokenization runs, GPUs spin up, but as soon as the training loop starts it dies cleanly: Progress bar sits at 0/72 then exits nvidia-smi goes back to no python processes, No obvious Python traceback, just the usual warnings about
use_cache=True is incompatible with gradient checkpointing and
torch.utils.checkpoint: use_reentrant should be passed explicitly. Same basic behavior whether I try Unsloth’s SFTTrainer or a plain HF QLoRA trainer. If anyone has a minimal LoRA / QLoRA training script that actually runs on this checkpoint on a single 2×L40S box, I’m happy to adapt it and then share back a cleaned-up version here so others don’t have to fight this.

Sign up or log in to comment