Training supported

#6
by tastelikefeet - opened

Qwen3 embedding models can be fine-tuned by SWIFT:

pip install ms-swift -U
INFONCE_MASK_FAKE_NEGATIVE=true \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft \
    --model Qwen/Qwen3-Embedding-4B \
    --task_type embedding \
    --model_type qwen3_emb \
    --train_type full \
    --dataset sentence-transformers/stsb:positive \
    --split_dataset_ratio 0.05 \
    --eval_strategy steps \
    --output_dir output \
    --eval_steps 20 \
    --num_train_epochs 5 \
    --save_steps 70 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 6e-6 \
    --loss_type infonce \
    --label_names labels \
    --dataloader_drop_last true \
    --deepspeed zero3

We uses --loss_type infonce, which is also the loss type used in training the original model. Other loss types can also be used, such as --loss_type cosine_similarity. InfoNCE loss is a contrastive learning loss. In the above script, we default to treating different samples as negative examples, which can be controlled through the INFONCE_USE_BATCH environment variable, which defaults to True. The above script has an additional environment variable: INFONCE_MASK_FAKE_NEGATIVE=true, which ignores negative examples with excessively high similarity values (e.g., negative examples with similarity higher than positive example similarity + 0.1), preventing interference from dataset duplication or false negative issues during training.

The dataset format corresponding to InfoNCE loss is as follows:

{"query": "sentence1", "response": "sentence1-pos", "rejected_response": ["sentence1-neg1", "sentence1-neg2"]}

The negative examples of this sample and the positive and negative examples of other samples will all be used as negative examples for this sample.

Documentation here:

https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html

When i am training using lora.
loss is declining to 4 decimal in the eval.
but when tested using merged model result is not improved.

I have trained model with below params

!/bin/bash

Simple training without DeepSpeed (more stable)

This version avoids the Triton/DeepSpeed compatibility issues

Fix for any potential torch compile issues

export TORCH_COMPILE_DISABLE=1

Set number of GPUs

For 561 samples: Use 1 GPU to avoid empty batches

For 5000+ samples: Use 4 GPUs

nproc_per_node=4

# --dataset /colossus/CL_TEST_TRAINER/dataset/output/dataset_qwen_format.jsonl \

export INFONCE_USE_BATCH=false

optional

export INFONCE_HARD_NEGATIVES=4

export INFONCE_TEMPERATURE=0.07
NPROC_PER_NODE=$nproc_per_node
nohup /colossus/miniconda3/envs/trainer_2/bin/swift sft
--dataset /colossus/CL_TEST_TRAINER/generated_data/dataset_qwen_format.jsonl
--output_dir /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542
--model Qwen/Qwen3-Embedding-8B
--task_type embedding
--train_type lora
--model_type qwen3_emb
--split_dataset_ratio 0.05
--eval_strategy steps
--eval_steps 50
--num_train_epochs 5
--save_steps 50
--save_total_limit 15
--metric_for_best_model "eval_loss"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 32
--gradient_checkpointing true
--learning_rate 1e-4
--loss_type infonce
--gradient_checkpointing true
--torch_dtype float16
--lora_rank 128
--lora_alpha 256
--attn_impl "flash_attention_2"
--max_length 32768
--lr_scheduler_type cosine
--weight_decay 0.1
--warmup_ratio 0.15
--resume_from_checkpoint /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542/checkpoint-250
--resume_only_model true
--seed 42 > training_fixed.log 2>&1
# --target_modules auto
eval accuracy is good

{'eval_loss': 0.00051994, 'eval_margin': 0.87741113, 'eval_mean_neg': -0.41950089, 'eval_mean_pos': 0.66871929, 'eval_runtime': 2108.8905, 'eval_samples_per_second': 1.391, 'eval_steps_per_second': 0.348, 'epoch': 1.15, 'global_step/max_steps': '500/2180', 'percentage': '22.94%', 'elapsed_time': '21h 53m 29s', 'remaining_time': '3d 1h 33m 19s', 'memory(GiB)': 114.47, 'train_speed(iter/s)': 0.006344}

Train: 23%|β–ˆβ–ˆβ–Ž | 500/2180 [21:53:29<128:04:34, 274.45s/it]
Val: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 734/734 [35:05<00:00, 3.07s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 500/2180 [21:53:29<128:04:34, 274.45s/it]
Val: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 734/734 [35:05<00:00, 2.87s/it]
/colossus/miniconda3/envs/trainer_2/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.
warnings.warn( # warn only once
[INFO:swift] Saving model checkpoint to /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542/v0-20251115-083354/checkpoint-500

Train: 23%|β–ˆβ–ˆβ–Ž | 501/2180 [21:57:59<422:28:34, 905.85s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 502/2180 [22:02:26<332:52:54, 714.17s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 503/2180 [22:06:52<270:05:03, 579.79s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 504/2180 [22:11:12<225:09:51, 483.65s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 505/2180 [22:15:51<196:35:58, 422.54s/it]

{'loss': 0.02854861, 'grad_norm': 0.12833126, 'learning_rate': 7.798e-05, 'epoch': 1.16, 'global_step/max_steps': '505/2180', 'percentage': '23.17%', 'elapsed_time': '22h 15m 51s', 'remaining_time': '3d 1h 50m 50s', 'memory(GiB)': 114.47, 'train_speed(iter/s)': 0.006301}

Train: 23%|β–ˆβ–ˆβ–Ž | 505/2180 [22:15:51<196:35:58, 422.54s/it]
Train: 23%|β–ˆβ–ˆβ–Ž | 505/2180 [22:15:51<196:35:58, 422.54s/it]
But when tested with Below accuracy is less on same eval data

def __init__(self, merged_model_path):
    print(f"Loading merged LoRA model from: {merged_model_path}")
    self.model_name = merged_model_path
    
    if not os.path.exists(merged_model_path):
        raise FileNotFoundError(f"Merged model not found at: {merged_model_path}")
    
    print(f"Loading tokenizer...")
    try:
        self.tokenizer = AutoTokenizer.from_pretrained(
            merged_model_path,
            padding_side='left',
            trust_remote_code=True
        )
    except:
        self.tokenizer = AutoTokenizer.from_pretrained(
            "Qwen/Qwen3-Embedding-8B",
            padding_side='left',
            trust_remote_code=True
        )
    
    print(f"Loading model weights...")
    self.model = AutoModel.from_pretrained(
        merged_model_path,
        dtype=torch.float16,
        trust_remote_code=True,
        device_map='cuda'
    )
    self.model.eval()
    print(f"βœ“ LoRA model loaded")
  
     def last_token_pool(self, last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    """Last token pooling"""
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_embedding(self, text, is_query=False):
    """Get embedding"""
    inputs = self.tokenizer(
        text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=MAX_LENGTH
    )
    inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = self.model(**inputs)
    
    embedding = self.last_token_pool(outputs.last_hidden_state, inputs['attention_mask'])
    embedding = F.normalize(embedding, p=2, dim=1)
    
    if torch.isnan(embedding).any():
        embedding = torch.nan_to_num(embedding, nan=0.0)
    
    return embedding.cpu().numpy()

How to correct the Result

Sign up or log in to comment