Llama-3.3-70B-FP8-Instruct-Neuron / README.md

Update README.md

bce8528 verified about 1 month ago

7.07 kB

	---
	license: llama3.3
	base_model:
	- meta-llama/Llama-3.3-70B-Instruct
	tags:
	- Neuron
	- Inferentia2
	- AWS
	- text-generation
	- fp8
	- quantized
	- vllm
	pipeline_tag: text-generation
	language:
	- en
	---

	# Llama-3.3-70B-FP8-Neuron

	This is an FP8-quantized version of Meta's Llama 3.3 70B model, specifically optimized for efficient inference on AWS Neuron accelerators (Inferentia2). The model has been compiled and quantized using AWS Neuron SDK to leverage the specialized AI acceleration capabilities of AWS Neuron chips.

	## Model Details

	### Model Description

	This model is a deployment-optimized version of Llama 3.3 70B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
	For better performance set Tp_degree=24 on Inf2.48xlarge [Total Token Throughput = ~600 tokens/sec]

	### Key Features

	* Reduced memory footprint through FP8 quantization (~50% reduction from FP16)
	* Optimized for AWS Inferentia2 instances
	* Pre-compiled for tensor parallelism across 2 NeuronCores
	* Maintains instruction-following capabilities of the base model
	* Cost-effective LLM serving with improved throughput

	### Model Specifications

	\| Specification \| Value \|
	\|--------------\|-------\|
	\| Base Model [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) \|
	\| Quantization \| FP8 E4M3 (IEEE-754 FP8_EXP4 format) \|
	\| Optimization Target \| [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) NeuronCores \|
	\| Tensor Parallelism Degree \| 24 \|
	\| Recommended Hardware \| AWS inf2.48xlarge \|
	\| Max Sequence Length \| 8192 tokens \|
	\| Developed by \| [Fraser Sequeira](https://www.linkedin.com/in/fraser-sequeira) \|

	## Quick Start

	### Prerequisites

	1. Launch an inf2.48xlarge Ubuntu EC2 instance on AWS
	2. Select the 'Deep Learning AMI Neuron (Ubuntu 22.04)' AMI

	### Installation & Setup

	#### 1. Launch Docker Container

	```bash
	docker run \
	-it \
	--device=/dev/neuron0 \
	--device=/dev/neuron1 \
	--device=/dev/neuron2 \
	--device=/dev/neuron3 \
	--device=/dev/neuron4 \
	--device=/dev/neuron5 \
	--device=/dev/neuron6 \
	--device=/dev/neuron7 \
	--device=/dev/neuron8 \
	--device=/dev/neuron9 \
	--device=/dev/neuron10 \
	--device=/dev/neuron11 \
	--cap-add SYS_ADMIN \
	--cap-add IPC_LOCK \
	-p 8080:8080 \
	--name llama3-3-70B \
	public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
	bash
	```

	2. Install Dependencies
	# Install required dependencies
	```bash
	pip install -U "huggingface_hub[cli]==0.36.0"
	```
	# Optional dependencies for benchmarking
	```bash
	pip install pandas datasets
	```

	3. Configure Hugging Face Access
	```bash
	export HF_TOKEN=<your-huggingface-token>
	```
	4. Download the Model
	```bash
	hf download fraseque/Llama-3.3-70B-FP8-Instruct-Neuron
	```

	5. Set Model Path
	The model is typically saved to:
	```bash
	/root/.cache/huggingface/hub/models--fraseque--llama-3.3-70B-FP8-Instruct-Neuron/snapshots/{{uuid}}
	```
	- Replace {{uuid}} with the actual snapshot ID
	```bash
	export MODEL_PATH=/root/.cache/huggingface/hub/models--fraseque--llama-3.3-70B-FP8-Instruct-Neuron/snapshots/{{uuid}}
	```

	7. Serve the Model
	Note: First time Compilation takes 20-30 minutes. You can set the NEURON_COMPILED_ARTIFACTS variable to skip compilation the second time.
	```bash
	VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
	--model "$MODEL_PATH" \
	--device "neuron" \
	--tensor-parallel-size 24 \
	--max-num-seqs 16 \
	--max-model-len 8192 \
	--port 8080 \
	--override-neuron-config "{\"enable_bucketing\": true, \"context_encoding_buckets\": [128,512,1024,2048,4096,8192], \"token_generation_buckets\": [128,512,1024,2048,4096,8192], \"max_context_length\": 8192, \"use-v2-block-manager\": true, \"seq_len\": 8192, \"quantization_dtype\":\"f8e4m3\", \"quantization_type\": \"per_channel_symmetric\", \"quantized_checkpoints_path\":\"$MODEL_PATH\", \"quantized\": true, \"batch_size\": 1, \"ctx_batch_size\": 1, \"tkg_batch_size\": 1, \"attn_kernel_enabled\": true, \"is_continuous_batching\": true}"

	```
	Making Inference Requests
	Once the server is running on Port 8080, you can make requests as follows:
	_Open another terminal and fire the below CURL request_

	```bash
	curl http://localhost:8080/v1/completions \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "<\|system\|>You are a helpful AI assistant.<\|user\|>What is the capital of France?<\|assistant\|>",
	"max_tokens": 100,
	"temperature": 0.1,
	"top_p": 0.9,
	"stop": ["<\|system\|>", "<\|user\|>", "<\|assistant\|>", "<\|end\|>", "\n\n"]
	}'
	```


	Benchmarking Performance
	_Open another terminal set the MODEL_PATH and fire the below benchmark command_
	```bash
	cd /opt/vllm/benchmarks
	python3 benchmark_serving.py --backend vllm --base-url http://127.0.0.1:8080 --dataset-name=random --model $MODEL_PATH --num-prompts 20 --max-concurrency 5 --request-rate inf --random-input-len 4000 --random-output-len 500 --seed 12345
	```


	Quantization Details
	Quantization Format FP8 E4M3 (8-bit floating point)
	Quantization Type Per-channel symmetric
	Tensor Parallelism (TP) 24
	Target Accelerator AWS Inferentia2
	Instance Type inf2.48xlarge
	Sequence Length 8192 tokens
	Use Cases
	Intended Use
	This model is optimized for:

	✅ Production inference deployments on AWS Inferentia2 instances
	✅ Cost-effective LLM serving with reduced computational requirements
	✅ Conversational AI applications requiring instruction-following
	✅ Text generation tasks (Q&A, summarization, creative writing)
	✅ Low-latency inference requirements

	Benefits of FP8 Quantization
	~50% memory reduction compared to FP16
	Improved throughput on Neuron accelerators
	Lower inference costs on AWS infrastructure
	Maintained accuracy with minimal degradation

	Out-of-Scope Use
	This model is NOT suitable for:
	❌ Deployment on non-Neuron hardware (GPUs, CPUs) without recompilation

	Limitations and Considerations
	Quantization artifacts: FP8 quantization may introduce minor accuracy degradation compared to full-precision models
	Hardware dependency: Compiled specifically for Neuron devices; requires recompilation for other hardware
	Max Sequence Length 8192 tokens

	Citation
	@misc{llama32-1b-fp8-neuron,
	author = {Sequeira, Fraser},
	title = {Llama-3.3-70B-FP8-Instruct-Neuron},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/fraseque/llama-3.3-70B-FP8-Instruct-Neuron}}
	}

	Model Card Authors
	- Fraser Sequeira

	Acknowledgments
	Base model: Meta's Llama 3.3 70B
	Quantization and compilation: AWS Neuron SDK [NEURONX_DISTRIBUTED_INFERENCE]
	Inference framework: vLLM with Neuron support
	License
	This model inherits the Llama 3.3 license from Meta. Please refer to the official license for terms and conditions.

	---