Environment and Installation

Qwen2-VL-7B-CML-SFT is fully supported by the latest Hugging Face Transformers codebase.
To ensure compatibility with the most recent vision–language features and model implementations, we strongly recommend installing Transformers from source instead of using the PyPI release.

pip install git+https://github.com/huggingface/transformers accelerate

This guarantees access to the newest multimodal model definitions and inference utilities required by Qwen2-VL-7B-CML-SFT models.

Visual Input Utilities

To simplify multimodal input handling, we provide an auxiliary toolkit that abstracts different types of visual inputs into a unified interface. The toolkit supports: Base64-encoded images Image and video URLs Interleaved image–video inputs The usage experience is designed to be API-like, reducing boilerplate code when working with complex multimodal inputs. For video-based scenarios, it is highly recommended to enable the decord backend for faster and more efficient video decoding.

pip install qwen-vl-utils[decord]==0.0.8

Please refer to our provided requirements.txt for the complete environment configuration.

Chat Inference with 🤗 Transformers

Below is a minimal example demonstrating how to load and use a Qwen2-VL-7B-CML-SFT chat model with 🤗 Transformers. The example integrates qwen_vl_utils to preprocess and normalize visual inputs before inference.


# -- coding: utf-8 --
# @time : 2025/12/4 14:33
# @author : shajiu
# @file : 
# @software: pycharm



import random
import numpy as np
import torch


import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "shajiu/Qwen2-VL-7B-CML-SFT"

# 1) 加载模型与处理器
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="auto",
    #attn_implementation="flash_attention_2"#(需额外安装 flash-attn)
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# 2) 构造多模态对话(image + text)
image_path = "bo_10070.png"  # 注意:官方示例使用 file:/// 前缀
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": "《རྨ་བྱའི་རྒྱན་གོས་ཏེ།ལུས་ཐོག་གྱོན་པའི་ལོ་ངོ་སྟོང་གི་ལོ་རྒྱུས།》ཡི་རྩ་རྒྱུད་ལ་གཞིགས་ནས་བརྗོད་བྱ་གཙོ་བོར་ཞིབ་ཚགས་བྱས་ནས་རྒྱས་བཤད་བྱེད་པ་མ་ཟད།པར་རིས་ཁྲོད་ཀྱི་གཙོ་གནད་ལ་བརྟེན་ནས་གསལ་བཤད་བྱེད་དགོས།"},
        ],
    }
]

# 3) 预处理 -> 张量
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# 放到模型所在设备(多卡/auto device_map 时更稳)
inputs = {k: v.to(model.device) for k, v in inputs.items()}


seed = 1234
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

gen_kwargs = dict(
    max_new_tokens=4000,
    do_sample=True,            # 开采样
    temperature=0.2,           # 越小越稳(0.2~0.4 常用)
    top_p=0.9,                 # 核采样
    top_k=50,                  # 限制候选集合,减少乱跑
    repetition_penalty=1.08,   # 稍强一点抑制复读
)

# 4) 推理生成
with torch.inference_mode():
    generated_ids = model.generate(**inputs, **gen_kwargs)



# 5) 只解码新生成部分
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
]
out_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print("推理结果:\n",out_text)
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support