krishnateja95 commited on
Commit
406afb6
·
verified ·
1 Parent(s): d53b96b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -0
README.md CHANGED
@@ -1,8 +1,174 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ### Accuracy
7
  <table>
8
  <thead>
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - fp8
6
+ - quantized
7
+ - llm-compressor
8
+ - compressed-tensors
9
+ - red hat
10
+ base_model:
11
+ - meta-llama/Llama-3.3-70B-Instruct
12
  ---
13
 
14
 
15
+ # Llama-3.3-70B-Instruct-FP8-block
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** LlamaForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** FP8
23
+ - **Activation quantization:** FP8
24
+ - **Release Date:**
25
+ - **Version:** 1.0
26
+ - **Model Developers:**: Red Hat
27
+
28
+ Quantized version of [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights and activations of [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to FP8 data type.
33
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
34
+ Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
35
+
36
+ ## Deployment
37
+
38
+ ### Use with vLLM
39
+
40
+ 1. Initialize vLLM server:
41
+ ```
42
+ vllm serve nm-testing/Llama-3.3-70B-Instruct-FP8-block --tensor_parallel_size 4
43
+ ```
44
+
45
+ 2. Send requests to the server:
46
+
47
+ ```python
48
+ from openai import OpenAI
49
+
50
+ # Modify OpenAI's API key and API base to use vLLM's API server.
51
+ openai_api_key = "EMPTY"
52
+ openai_api_base = "http://<your-server-host>:8000/v1"
53
+
54
+ client = OpenAI(
55
+ api_key=openai_api_key,
56
+ base_url=openai_api_base,
57
+ )
58
+
59
+ model = "nm-testing/Llama-3.3-70B-Instruct-FP8-block"
60
+
61
+ messages = [
62
+ {
63
+ "role": "user",
64
+ "content": [
65
+ {
66
+ "type": "image_url",
67
+ "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
68
+ },
69
+ {"type": "text", "text": "Describe this image."},
70
+ ],
71
+ }
72
+ ]
73
+
74
+ outputs = client.chat.completions.create(
75
+ model=model,
76
+ messages=messages,
77
+ )
78
+
79
+ generated_text = outputs.choices[0].message.content
80
+ print(generated_text)
81
+ ```
82
+
83
+ ## Creation
84
+
85
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
86
+
87
+ <details>
88
+ <summary>Creation details</summary>
89
+
90
+ ```python
91
+ from transformers import AutoProcessor, LlamaForCausalLM
92
+
93
+ from llmcompressor import oneshot
94
+ from llmcompressor.modeling import replace_modules_for_calibration
95
+ from llmcompressor.modifiers.quantization import QuantizationModifier
96
+
97
+ MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
98
+
99
+ # Load model.
100
+ model = LlamaForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
101
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
102
+ model = replace_modules_for_calibration(model)
103
+
104
+ # Configure the quantization algorithm and scheme.
105
+ # In this case, we:
106
+ # * quantize the weights to fp8 with per-block quantization
107
+ # * quantize the activations to fp8 with dynamic token activations
108
+ recipe = QuantizationModifier(
109
+ targets="Linear",
110
+ scheme="FP8_BLOCK",
111
+ ignore=["lm_head"],
112
+ )
113
+
114
+ # Apply quantization.
115
+ oneshot(model=model, recipe=recipe)
116
+
117
+ # Save to disk in compressed-tensors format.
118
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
119
+ model.save_pretrained(SAVE_DIR)
120
+ processor.save_pretrained(SAVE_DIR)
121
+ ```
122
+ </details>
123
+
124
+
125
+ ## Evaluation
126
+
127
+
128
+ The model was evaluated on the OpenLLMv1 leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), on reasoning tasks using [lighteval](https://github.com/huggingface/lighteval).
129
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
130
+
131
+ <details>
132
+ <summary>Evaluation details</summary>
133
+
134
+ **lm-evaluation-harness**
135
+ ```
136
+ lm_eval \
137
+ --model vllm \
138
+ --model_args pretrained="nm-testing/Llama-3.3-70B-Instruct-FP8-block",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=4,gpu_memory_utilization=0.8,enable_chunked_prefill=True \
139
+ --tasks openllm \
140
+ --write_out \
141
+ --batch_size auto \
142
+ --output_path output_dir \
143
+ --show_config
144
+ ```
145
+
146
+ **lighteval**
147
+
148
+ lighteval_model_arguments.yaml
149
+ ```yaml
150
+ model_parameters:
151
+ model_name: nm-testing/Llama-3.3-70B-Instruct-FP8-block
152
+ dtype: auto
153
+ gpu_memory_utilization: 0.9
154
+ generation_parameters:
155
+ temperature: 0.6
156
+ min_p: 0.0
157
+ top_p: 0.95
158
+ top_k: 20
159
+ max_new_tokens: 32768
160
+ ```
161
+
162
+ ```
163
+ lighteval vllm \
164
+ --model_args lighteval_model_arguments.yaml \
165
+ --tasks lighteval|aime25|0 \
166
+ ```
167
+
168
+
169
+ </details>
170
+
171
+
172
  ### Accuracy
173
  <table>
174
  <thead>