GPT-OSS 20B ONNX (RTN Quantized for CPU)
This model is an optimized version of gpt-oss-20b to enable local inference on CPUs using RTN (Round-to-Nearest) quantization.
Model Summary
- Developed by: Microsoft
- Model Type: ONNX
- License: Apache-2.0
- Optimization: RTN Quantization for efficient CPU memory usage
- Target Hardware: Optimized for local inference on CPUs
Technical Description
This repository contains a conversion of the gpt-oss-20b model specifically tailored for local inference on CPUs. By utilizing the ONNX format and RTN quantization, the model achieves a significant reduction in RAM footprint while maintaining the core capabilities of the base architecture, making it suitable for environments where GPU acceleration is not available.
Base Model Information
For detailed information regarding the architecture, training data, and intended use cases, please refer to the original gpt-oss-20b model on Azure AI Foundry.
Deployment
Compatible with ONNX Runtime (ORT) using the default CPU Execution Provider.