MS MARCO MiniLM-L6-v2 ONNX QInt8 for CPU

This repository contains a QInt8 ONNX Runtime derivative of cross-encoder/ms-marco-MiniLM-L-6-v2.

It is intended for CPU inference with ONNX Runtime and keeps the tokenizer/config payload needed to score query-document pairs without the original full-precision model weights.

Base model

Quantization

  • Runtime: ONNX Runtime 1.22.1
  • Method: dynamic quantization
  • Weight type: QInt8
  • Intended provider: CPUExecutionProvider

Files

  • onnx/model.onnx
  • config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • vocab.txt

Benchmark

Benchmarked locally on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.

20 documents per request

variant size (MB) avg latency (ms) p95 latency (ms) throughput (req/s) pairs/s
upstream ONNX 86.80 293.09 402.45 3.41 68.24
upstream quint8_avx2 22.13 241.01 268.70 4.15 82.98
this QInt8 repo 22.11 210.12 248.74 4.76 95.18

50 documents per request

variant size (MB) avg latency (ms) p95 latency (ms) throughput (req/s) pairs/s
upstream ONNX 86.80 773.54 884.32 1.29 64.64
upstream quint8_avx2 22.13 644.78 721.12 1.55 77.55
this QInt8 repo 22.11 577.89 644.01 1.73 86.52

Relative to the upstream ONNX release on this host:

  • 20 docs/request throughput: +39.6%
  • 50 docs/request throughput: +34.1%
  • artifact size: -74.5%

Validation notes

  • Top-5 overlap with the upstream ONNX ranking on a small synthetic benchmark:
    • 20 docs/query: 4/5
    • 50 docs/query: 2/5
  • Ranking quality should be validated on your own corpus before replacing the upstream model.
  • This repository is CPU-oriented. GPU users should benchmark the non-quantized ONNX release separately.

Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

query = "how do I renew my driving licence in ireland"
document = "You can renew your driving licence online if you meet the identity requirements."
encoded = tokenizer([[query, document]], return_tensors="np", truncation=True, padding=True, max_length=128)
inputs = {name: value.astype(np.int64) for name, value in encoded.items() if name in {inp.name for inp in session.get_inputs()}}
scores = session.run(None, inputs)[0]
print(scores)
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/ms-marco-MiniLM-L-6-v2-onnx-cpu-qint8