--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: apache-2.0 base_model: Qwen/Qwen3-VL-235B-A22B-Instruct --- # Qwen3-VL-235B-A22B-Instruct-NVFP4 ## Model Overview - **Model Architecture:** Qwen/Qwen3-VL-235B-A22B-Instruct - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **Release Date:** 10/29/2025 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4" number_gpus = 1 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.
```python import torch from datasets import load_dataset from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration from llmcompressor import oneshot from llmcompressor.modeling import replace_modules_for_calibration from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation # NOTE: Requires a minimum of transformers 4.57.0 MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct" # Load model. model = Qwen3VLMoeForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto") processor = AutoProcessor.from_pretrained(MODEL_ID) model = replace_modules_for_calibration(model) DATASET_ID = "neuralmagic/calibration" NUM_CALIBRATION_SAMPLES = 20 MAX_SEQUENCE_LENGTH = 8192 ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]") def preprocess_function(example): messgages = [] for message in example["messages"]: messgages.append( { "role": message["role"], "content": [{"type": "text", "text": message["content"]}], } ) return processor.apply_chat_template( messgages, return_tensors="pt", padding=False, truncation=True, max_length=MAX_SEQUENCE_LENGTH, tokenize=True, add_special_tokens=False, return_dict=True, add_generation_prompt=False, ) ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names) def data_collator(batch): assert len(batch) == 1 return { key: ( torch.tensor(value) if key != "pixel_values" else torch.tensor(value, dtype=torch.bfloat16).squeeze(0) ) for key, value in batch[0].items() } # Configure the quantization algorithm and scheme. # In this case, we: # * quantize the weights to fp4 with group-wise quantization # * quantize the activations to fp4 with dynamic group activations recipe = QuantizationModifier( targets="Linear", scheme="NVFP4", ignore=[ "re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", ], ) # Apply quantization. oneshot( model=model, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, dataset=ds, data_collator=data_collator, ) print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=20) print(processor.decode(output[0])) print("==========================================") # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" model.save_pretrained(SAVE_DIR) processor.save_pretrained(SAVE_DIR) ```
## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval). ### Accuracy
Category Metric Qwen/Qwen3-VL-235B-A22B-Instruct RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (this model) Recovery
OpenLLM arc_challenge 72.95 71.59 98.13
gsm8k 90.37 88.25 97.65
hellaswag 87.94 86.80 98.70
mmlu 87.12 86.22 98.97
truthfulqa_mc2 63.31 62.37 98.52
winogrande 81.93 80.43 98.17
Average 80.60 79.28 98.35
Vision mmmu_val 63.56 62.11 97.71
chartqa 90.52 89.00 98.32
Average 77.04 75.56 98.08
### Reproduction The results were obtained using the following commands:
#### OpenLLM ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### Vision ``` python3 -m lmms_eval \ --model vllm \ --model_args model=RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4,tensor_parallel_size=4,max_model_len=20000 \ --tasks chartqa,mmmu_val \ --batch_size 1 ```