Lora Fine Tning- Several Issues

#25

by aetherforge - opened 8 days ago

8 days ago

Has anyone successfully trained this model? I have tried several ways including unsloth, Lora, etc. Error after error. Any tips or a training script that works on a L40S-180 would be awesome. Free access to L40S-90, 180, and 360(reasonable access, nothing crazy, you want to fine tune a model or 2, I'm good with that) to anyone that can help.

aetherforge

8 days ago

Quick context:setup 2× NVIDIA L40S on OVH (44 GB each, single node) PyTorch 2.5.1 + cu121 transformers 4.57.2 peft + trl installed from pip this week (can post exact versions if needed)
QLoRA, 4-bit, LoRA rank 32, target_modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] Training on 372 train examples + 93 val examples JSONL, already formatted to chat / SFT style. Model and data load fine, tokenization runs, GPUs spin up, but as soon as the training loop starts it dies cleanly: Progress bar sits at 0/72 then exits nvidia-smi goes back to no python processes, No obvious Python traceback, just the usual warnings about
use_cache=True is incompatible with gradient checkpointing and
torch.utils.checkpoint: use_reentrant should be passed explicitly. Same basic behavior whether I try Unsloth’s SFTTrainer or a plain HF QLoRA trainer. If anyone has a minimal LoRA / QLoRA training script that actually runs on this checkpoint on a single 2×L40S box, I’m happy to adapt it and then share back a cleaned-up version here so others don’t have to fight this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment