MS MARCO MiniLM-L6-v2 ONNX QInt8 for CPU
This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-6-v2.
It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.
Base model
- Upstream model: cross-encoder/ms-marco-MiniLM-L-6-v2
- Upstream revision:
c5ee24cb16019beea0893ab7796b1df96625c6b8 - Upstream license:
Apache-2.0
Quantization
- Runtime: ONNX Runtime
1.22.1 - Method: dynamic quantization
- Weight type:
QInt8 - Intended provider:
CPUExecutionProvider
Files
onnx/model.onnxconfig.jsontokenizer.jsontokenizer_config.jsonspecial_tokens_map.jsonvocab.txt
Benchmark
Benchmarked locally on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.
20 documents per request
| variant | size (MB) | avg latency (ms) | p95 latency (ms) | throughput (req/s) | pairs/s |
|---|---|---|---|---|---|
| upstream ONNX | 86.80 | 293.09 | 402.45 | 3.41 | 68.24 |
upstream quint8_avx2 |
22.13 | 241.01 | 268.70 | 4.15 | 82.98 |
| this QInt8 repo | 22.11 | 210.12 | 248.74 | 4.76 | 95.18 |
50 documents per request
| variant | size (MB) | avg latency (ms) | p95 latency (ms) | throughput (req/s) | pairs/s |
|---|---|---|---|---|---|
| upstream ONNX | 86.80 | 773.54 | 884.32 | 1.29 | 64.64 |
upstream quint8_avx2 |
22.13 | 644.78 | 721.12 | 1.55 | 77.55 |
| this QInt8 repo | 22.11 | 577.89 | 644.01 | 1.73 | 86.52 |
Relative to the upstream ONNX release on this host:
- 20 docs/request throughput:
+39.6% - 50 docs/request throughput:
+34.1% - artifact size:
-74.5%
Validation notes
- Top-5 overlap with the upstream ONNX ranking on a small synthetic benchmark:
- 20 docs/query:
4/5 - 50 docs/query:
2/5
- 20 docs/query:
- Ranking quality should be validated on your own corpus before replacing the upstream model.
- This repository is CPU-oriented. GPU users should benchmark the non-quantized ONNX release separately.
Usage
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
repo_id = "temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)
- Downloads last month
- 13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8
Base model
microsoft/MiniLM-L12-H384-uncased
Quantized
cross-encoder/ms-marco-MiniLM-L12-v2
Quantized
cross-encoder/ms-marco-MiniLM-L6-v2