RedHatAI
/

Llama-3.3-70B-Instruct-FP8-block

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+tags:
+- fp8
+- quantized
+- llm-compressor
+- compressed-tensors
+- red hat
+base_model:
+- meta-llama/Llama-3.3-70B-Instruct
 ---
+# Llama-3.3-70B-Instruct-FP8-block
+## Model Overview
+- **Model Architecture:** LlamaForCausalLM
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Release Date:**
+- **Version:** 1.0
+- **Model Developers:**: Red Hat
+Quantized version of [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
+### Model Optimizations
+This model was obtained by quantizing the weights and activations of [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to FP8 data type.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
+## Deployment
+### Use with vLLM
+1. Initialize vLLM server:
+```
+vllm serve nm-testing/Llama-3.3-70B-Instruct-FP8-block --tensor_parallel_size 4
+```
+2. Send requests to the server:
+```python
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+model = "nm-testing/Llama-3.3-70B-Instruct-FP8-block"
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image_url",
+                "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+outputs = client.chat.completions.create(
+    model=model,
+    messages=messages,
+)
+generated_text = outputs.choices[0].message.content
+print(generated_text)
+```
+## Creation
+This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
+<details>
+  <summary>Creation details</summary>
+```python
+from transformers import AutoProcessor, LlamaForCausalLM
+from llmcompressor import oneshot
+from llmcompressor.modeling import replace_modules_for_calibration
+from llmcompressor.modifiers.quantization import QuantizationModifier
+MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
+# Load model.
+model = LlamaForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+processor = AutoProcessor.from_pretrained(MODEL_ID)
+model = replace_modules_for_calibration(model)
+# Configure the quantization algorithm and scheme.
+# In this case, we:
+#   * quantize the weights to fp8 with per-block quantization
+#   * quantize the activations to fp8 with dynamic token activations
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_BLOCK",
+    ignore=["lm_head"],
+)
+# Apply quantization.
+oneshot(model=model, recipe=recipe)
+# Save to disk in compressed-tensors format.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
+model.save_pretrained(SAVE_DIR)
+processor.save_pretrained(SAVE_DIR)
+```
+</details>
+## Evaluation
+The model was evaluated on the OpenLLMv1 leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), on reasoning tasks using [lighteval](https://github.com/huggingface/lighteval).
+[vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
+<details>
+  <summary>Evaluation details</summary>
+  **lm-evaluation-harness**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="nm-testing/Llama-3.3-70B-Instruct-FP8-block",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
+    --tasks openllm \
+    --write_out \
+    --batch_size auto \
+    --output_path output_dir \
+    --show_config
+  ```
+  **lighteval**
+  lighteval_model_arguments.yaml
+  ```yaml
+  model_parameters:
+    model_name: nm-testing/Llama-3.3-70B-Instruct-FP8-block
+    dtype: auto
+    gpu_memory_utilization: 0.9
+    generation_parameters:
+      temperature: 0.6
+      min_p: 0.0
+      top_p: 0.95
+      top_k: 20
+      max_new_tokens: 32768
+  ```
+  ```
+  lighteval vllm \
+    --model_args lighteval_model_arguments.yaml \
+    --tasks lighteval|aime25|0 \
+  ```
+</details>
 ### Accuracy
 <table>
   <thead>