MS MARCO MiniLM-L6-v2 ONNX QInt8 for CPU

This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-6-v2.

It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.

Base model

Upstream model: cross-encoder/ms-marco-MiniLM-L-6-v2
Upstream revision: c5ee24cb16019beea0893ab7796b1df96625c6b8
Upstream license: Apache-2.0

Quantization

Runtime: ONNX Runtime 1.22.1
Method: dynamic quantization
Weight type: QInt8
Intended provider: CPUExecutionProvider

Files

onnx/model.onnx
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt

Benchmark

Benchmarked locally on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.

20 documents per request

variant	size (MB)	avg latency (ms)	p95 latency (ms)	throughput (req/s)	pairs/s
upstream ONNX	86.80	293.09	402.45	3.41	68.24
upstream `quint8_avx2`	22.13	241.01	268.70	4.15	82.98
this QInt8 repo	22.11	210.12	248.74	4.76	95.18

50 documents per request

variant	size (MB)	avg latency (ms)	p95 latency (ms)	throughput (req/s)	pairs/s
upstream ONNX	86.80	773.54	884.32	1.29	64.64
upstream `quint8_avx2`	22.13	644.78	721.12	1.55	77.55
this QInt8 repo	22.11	577.89	644.01	1.73	86.52

Relative to the upstream ONNX release on this host:

20 docs/request throughput: +39.6%
50 docs/request throughput: +34.1%
artifact size: -74.5%

Validation notes

Top-5 overlap with the upstream ONNX ranking on a small synthetic benchmark:
- 20 docs/query: 4/5
- 50 docs/query: 2/5
Ranking quality should be validated on your own corpus before replacing the upstream model.
This repository is CPU-oriented. GPU users should benchmark the non-quantized ONNX release separately.

Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)

Downloads last month: 13

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8

Base model

microsoft/MiniLM-L12-H384-uncased

Quantized

cross-encoder/ms-marco-MiniLM-L12-v2

Quantized

cross-encoder/ms-marco-MiniLM-L6-v2

Quantized

(18)

this model