Training supported
Qwen3 embedding models can be fine-tuned by SWIFT:
pip install ms-swift -U
INFONCE_MASK_FAKE_NEGATIVE=true \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft \
--model Qwen/Qwen3-Embedding-4B \
--task_type embedding \
--model_type qwen3_emb \
--train_type full \
--dataset sentence-transformers/stsb:positive \
--split_dataset_ratio 0.05 \
--eval_strategy steps \
--output_dir output \
--eval_steps 20 \
--num_train_epochs 5 \
--save_steps 70 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 6e-6 \
--loss_type infonce \
--label_names labels \
--dataloader_drop_last true \
--deepspeed zero3
We uses --loss_type infonce, which is also the loss type used in training the original model. Other loss types can also be used, such as --loss_type cosine_similarity. InfoNCE loss is a contrastive learning loss. In the above script, we default to treating different samples as negative examples, which can be controlled through the INFONCE_USE_BATCH environment variable, which defaults to True. The above script has an additional environment variable: INFONCE_MASK_FAKE_NEGATIVE=true, which ignores negative examples with excessively high similarity values (e.g., negative examples with similarity higher than positive example similarity + 0.1), preventing interference from dataset duplication or false negative issues during training.
The dataset format corresponding to InfoNCE loss is as follows:
{"query": "sentence1", "response": "sentence1-pos", "rejected_response": ["sentence1-neg1", "sentence1-neg2"]}
The negative examples of this sample and the positive and negative examples of other samples will all be used as negative examples for this sample.
Documentation here:
https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
When i am training using lora.
loss is declining to 4 decimal in the eval.
but when tested using merged model result is not improved.
I have trained model with below params
!/bin/bash
Simple training without DeepSpeed (more stable)
This version avoids the Triton/DeepSpeed compatibility issues
Fix for any potential torch compile issues
export TORCH_COMPILE_DISABLE=1
Set number of GPUs
For 561 samples: Use 1 GPU to avoid empty batches
For 5000+ samples: Use 4 GPUs
nproc_per_node=4
# --dataset /colossus/CL_TEST_TRAINER/dataset/output/dataset_qwen_format.jsonl \
export INFONCE_USE_BATCH=false
optional
export INFONCE_HARD_NEGATIVES=4
export INFONCE_TEMPERATURE=0.07
NPROC_PER_NODE=$nproc_per_node
nohup /colossus/miniconda3/envs/trainer_2/bin/swift sft
--dataset /colossus/CL_TEST_TRAINER/generated_data/dataset_qwen_format.jsonl
--output_dir /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542
--model Qwen/Qwen3-Embedding-8B
--task_type embedding
--train_type lora
--model_type qwen3_emb
--split_dataset_ratio 0.05
--eval_strategy steps
--eval_steps 50
--num_train_epochs 5
--save_steps 50
--save_total_limit 15
--metric_for_best_model "eval_loss"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 32
--gradient_checkpointing true
--learning_rate 1e-4
--loss_type infonce
--gradient_checkpointing true
--torch_dtype float16
--lora_rank 128
--lora_alpha 256
--attn_impl "flash_attention_2"
--max_length 32768
--lr_scheduler_type cosine
--weight_decay 0.1
--warmup_ratio 0.15
--resume_from_checkpoint /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542/checkpoint-250
--resume_only_model true
--seed 42 > training_fixed.log 2>&1
# --target_modules auto
eval accuracy is good
{'eval_loss': 0.00051994, 'eval_margin': 0.87741113, 'eval_mean_neg': -0.41950089, 'eval_mean_pos': 0.66871929, 'eval_runtime': 2108.8905, 'eval_samples_per_second': 1.391, 'eval_steps_per_second': 0.348, 'epoch': 1.15, 'global_step/max_steps': '500/2180', 'percentage': '22.94%', 'elapsed_time': '21h 53m 29s', 'remaining_time': '3d 1h 33m 19s', 'memory(GiB)': 114.47, 'train_speed(iter/s)': 0.006344}
Train: 23%|βββ | 500/2180 [21:53:29<128:04:34, 274.45s/it]
Val: 100%|ββββββββββ| 734/734 [35:05<00:00, 3.07s/it]
Train: 23%|βββ | 500/2180 [21:53:29<128:04:34, 274.45s/it]
Val: 100%|ββββββββββ| 734/734 [35:05<00:00, 2.87s/it]
/colossus/miniconda3/envs/trainer_2/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.
warnings.warn( # warn only once
[INFO:swift] Saving model checkpoint to /colossus/CL_TEST_TRAINER/output/simple_run_10percent/v138-20251114-055542/v0-20251115-083354/checkpoint-500
Train: 23%|βββ | 501/2180 [21:57:59<422:28:34, 905.85s/it]
Train: 23%|βββ | 502/2180 [22:02:26<332:52:54, 714.17s/it]
Train: 23%|βββ | 503/2180 [22:06:52<270:05:03, 579.79s/it]
Train: 23%|βββ | 504/2180 [22:11:12<225:09:51, 483.65s/it]
Train: 23%|βββ | 505/2180 [22:15:51<196:35:58, 422.54s/it]
{'loss': 0.02854861, 'grad_norm': 0.12833126, 'learning_rate': 7.798e-05, 'epoch': 1.16, 'global_step/max_steps': '505/2180', 'percentage': '23.17%', 'elapsed_time': '22h 15m 51s', 'remaining_time': '3d 1h 50m 50s', 'memory(GiB)': 114.47, 'train_speed(iter/s)': 0.006301}
Train: 23%|βββ | 505/2180 [22:15:51<196:35:58, 422.54s/it]
Train: 23%|βββ | 505/2180 [22:15:51<196:35:58, 422.54s/it]
But when tested with Below accuracy is less on same eval data
def __init__(self, merged_model_path):
print(f"Loading merged LoRA model from: {merged_model_path}")
self.model_name = merged_model_path
if not os.path.exists(merged_model_path):
raise FileNotFoundError(f"Merged model not found at: {merged_model_path}")
print(f"Loading tokenizer...")
try:
self.tokenizer = AutoTokenizer.from_pretrained(
merged_model_path,
padding_side='left',
trust_remote_code=True
)
except:
self.tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen3-Embedding-8B",
padding_side='left',
trust_remote_code=True
)
print(f"Loading model weights...")
self.model = AutoModel.from_pretrained(
merged_model_path,
dtype=torch.float16,
trust_remote_code=True,
device_map='cuda'
)
self.model.eval()
print(f"β LoRA model loaded")
def last_token_pool(self, last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
"""Last token pooling"""
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_embedding(self, text, is_query=False):
"""Get embedding"""
inputs = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=MAX_LENGTH
)
inputs = {k: v.cuda() for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs)
embedding = self.last_token_pool(outputs.last_hidden_state, inputs['attention_mask'])
embedding = F.normalize(embedding, p=2, dim=1)
if torch.isnan(embedding).any():
embedding = torch.nan_to_num(embedding, nan=0.0)
return embedding.cpu().numpy()
How to correct the Result