SayText — Multilingual Text Normalization for TTS
A multilingual neural text normalization model for TTS (text-to-speech) pipelines. Converts written text to spoken form across 8 European languages using a fine-tuned ByT5-Base (580M parameters).
"Das kostet 12,50 €."→"Das kostet zwölf Euro fünfzig."
Key Features
- 8 languages: German, English, French, Italian, Portuguese, Spanish, Turkish, Swedish
- 24 semiotic classes: cardinals, money, dates, time, phone numbers, percentages, units, passthrough, and more
- Passthrough-aware: learns when NOT to normalize (plain text, already-spoken forms, technical identifiers)
- Voice-agent optimized: designed for LLM → TN → TTS pipelines where input is always well-formatted
- Byte-level: ByT5 processes raw UTF-8 bytes — no tokenizer vocabulary limitations, handles
€,₺,°Cnatively
Quick Start
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained("smaoai/saytext")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
model.eval()
def normalize(text: str, language: str) -> str:
"""Normalize text for TTS. Language: de, en, fr, it, pt, es, tr, sv"""
input_text = f"<{language}> {text}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, num_beams=1)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Examples
print(normalize("Das kostet 12,50 €.", "de"))
# → Das kostet zwölf Euro fünfzig.
print(normalize("The flight departs at 8:45 AM.", "en"))
# → The flight departs at eight forty-five AM.
print(normalize("Le prix est de 327,67 €.", "fr"))
# → Le prix est de trois cent vingt-sept euros et soixante-sept centimes.
print(normalize("Ich helfe Ihnen gerne weiter.", "de"))
# → Ich helfe Ihnen gerne weiter. (passthrough — no normalization needed)
print(normalize("Wir verwenden Python 3.10 in unserem Projekt.", "de"))
# → Wir verwenden Python 3.10 in unserem Projekt. (technical identifier preserved)
Available Formats
This repo includes the model in three formats:
| Format | Path | Size | Use case |
|---|---|---|---|
| PyTorch (default) | model.safetensors |
2.2 GB | Development, fine-tuning, HuggingFace pipeline() |
| CTranslate2 FP16 | ct2_float16/ |
1.1 GB | GPU production inference |
| CTranslate2 INT8 | ct2_int8/ |
556 MB | CPU production inference, edge deployment |
repo/
model.safetensors # PyTorch weights
config.json # Model architecture
tokenizer_config.json # Tokenizer config
added_tokens.json # Additional tokens
generation_config.json # Default generation parameters
handler.py # HuggingFace Inference API handler
ct2_float16/ # CTranslate2 FP16 (GPU)
model.bin
config.json
shared_vocabulary.json
ct2_int8/ # CTranslate2 INT8 (CPU)
model.bin
config.json
shared_vocabulary.json
Model Description
Why ByT5?
Text normalization deals with symbols, digits, and special characters (€12,50, +49(0)30, info@web.de) that subword tokenizers fragment unpredictably. ByT5 operates on raw UTF-8 bytes — every character is processed individually with no tokenizer artifacts. This is critical for:
- Locale-sensitive formats:
1.500means "one thousand five hundred" in German but "one point five" in English - Special symbols:
€,£,₺,%,°Care single bytes, not fragmented subwords - Phone numbers:
+49 (0)30 12345678stays intact byte-by-byte
Architecture
| Component | Details |
|---|---|
| Base model | google/byt5-base |
| Parameters | 580M |
| Type | Encoder-decoder (seq2seq) |
| Tokenization | Byte-level UTF-8 (no SentencePiece) |
| Max input length | 512 bytes (~250 characters) |
| Max output length | 512 bytes |
| Language conditioning | Prefix token: <de>, <en>, <fr>, etc. |
How It Works
The model takes text with a language prefix and outputs the spoken form:
Input: <de> Am 03.04.2026 um 14:30 kostet der Flug 249,99 €.
Output: Am dritte April zweitausendsechsundzwanzig um vierzehn Uhr dreißig
kostet der Flug zweihundertneunundvierzig Euro neunundneunzig.
For text that doesn't need normalization, the model learns to pass it through unchanged:
Input: <en> Sure, I can help you with that.
Output: Sure, I can help you with that.
Training
The model was trained on 3M+ pairs across 8 languages using a custom two-layer data pipeline:
Entity Sampler (deterministic) — Generates verified (written, spoken) pairs per semiotic class using locale-aware libraries. Every pair is programmatically verified correct.
Sentence Generator (LLM-powered) — Natural sentence templates with placeholders, filled with entity sampler pairs at assembly time.
Real-world enrichment — Additional real-world text sources, auto-labeled and validated, merged with synthetic data.
The training data covers 24 semiotic classes including passthrough (unchanged text), already-normalized forms, and technical identifiers that should not be normalized.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | google/byt5-base |
| Effective batch size | 128 |
| Learning rate | 3e-4 (cosine schedule) |
| Precision | bf16 |
| Hardware | NVIDIA A100 80GB |
| Best eval loss | 0.000897 |
| Framework | HuggingFace Transformers + PyTorch |
Evaluation
94.2% sentence-level exact match accuracy on a held-out test set (1,900 stratified sample across all languages and classes).
| Language | Accuracy |
|---|---|
| Swedish | 96.7% |
| French | 95.4% |
| Turkish | 94.8% |
| Spanish | 94.2% |
| English | 93.8% |
| Italian | 93.3% |
| German | 92.9% |
| Portuguese | 92.6% |
Per-Class Accuracy (selected)
| Class | Accuracy |
|---|---|
| Cardinal | 97.5% |
| Money | 97.5% |
| Year | 97.5% |
| Passthrough (plain) | 100% |
| Don't normalize | 97.5% |
| Phone numbers | 90.0% |
| Multi-entity (MIXED) | 51.2% |
Inference
Input Format
The model expects a language prefix followed by the text:
<de> Das kostet 12,50 €.
<en> The price is $99.99.
<fr> Le prix est de 327,67 €.
The language prefix is required — it tells the model which normalization rules to apply.
CPU Inference (PyTorch)
| Metric | Value |
|---|---|
| Average latency | ~600ms per sentence |
| Passthrough (no entities) | ~300ms |
| Model size | 2.2 GB |
CPU Inference (CTranslate2 INT8)
| Metric | Value |
|---|---|
| Average latency (single) | ~248ms |
| Average latency (batch of 10) | ~120ms per sentence |
| Model size | 556 MB |
# Convert to CTranslate2 INT8
pip install ctranslate2
ct2-transformers-converter --model smaoai/saytext \
--output_dir ct2_int8 --quantization int8
# Usage
import ctranslate2
from transformers import AutoTokenizer
translator = ctranslate2.Translator("ct2_int8", device="cpu", intra_threads=4)
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
text = "<de> Das kostet 12,50 €."
tokens = tokenizer(text, return_tensors=None)["input_ids"]
token_strs = [tokenizer.decode([t]) for t in tokens]
result = translator.translate_batch([token_strs], beam_size=1, max_decoding_length=512)
output_ids = tokenizer.convert_tokens_to_ids(result[0].hypotheses[0])
print(tokenizer.decode(output_ids, skip_special_tokens=True))
GPU Inference
For production TTS pipelines, a small GPU is recommended for real-time latency:
| Setup | Single sentence | Batch of 8 |
|---|---|---|
| CTranslate2 FP16, GPU | ~15-50ms | ~8-15ms/sent |
| PyTorch FP32, GPU | ~280ms | ~63ms/sent |
Supported Languages
| Code | Language | Example Input | Example Output |
|---|---|---|---|
de |
German | Das kostet 12,50 €. |
Das kostet zwölf Euro fünfzig. |
en |
English | The flight departs at 8:45 AM. |
The flight departs at eight forty-five AM. |
fr |
French | Le prix est de 327,67 €. |
Le prix est de trois cent vingt-sept euros et soixante-sept centimes. |
it |
Italian | La riunione è alle 14:30. |
La riunione è alle quattordici e trenta. |
pt |
Portuguese | Em 2023, o crescimento foi notável. |
Em dois mil e vinte e três, o crescimento foi notável. |
es |
Spanish | La fecha es el 20 de enero de 2025. |
La fecha es el veinte de enero de dos mil veinticinco. |
tr |
Turkish | Toplam 250 kişi katıldı. |
Toplam ikiyüzelli kişi katıldı. |
sv |
Swedish | Det väger 2,5 kg. |
Det väger två komma fem kilogram. |
Limitations & Known Issues
- Multi-entity accuracy is 51% — sentences with 2-3 different entity types sometimes produce errors on one of the entities.
- Date format variants — handles standard formats well but struggles with abbreviated months, slash dates, and 2-digit years.
- German declension — sometimes produces nominative case instead of the correct accusative/dative.
- CPU latency — 250-600ms per sentence on CPU. Use GPU for real-time applications.
- Training not complete — trained for 1.29 epochs out of 3 planned. Further training would improve edge cases.
Not Designed For
- Raw user input — optimized for well-formatted LLM output, not OCR text or social media
- Acronym spelling —
BMW → B M Wnot included (TTS engines handle this natively via SSML) - Hashtags, IBANs, Roman numerals, sports scores — dropped as low-priority for voice agent use cases
Recommended Post-Processing
import re
def post_validate(input_text: str, output_text: str) -> bool:
"""Check if the model output is safe for TTS."""
if re.search(r"\d", output_text):
return False # Fall back to rule-based normalization
ratio = len(output_text) / max(len(input_text), 1)
if ratio > 5.0 or ratio < 0.3:
return False
return True
Intended Use
This model is designed for text-to-speech preprocessing in voice agent / conversational AI pipelines:
User speaks → ASR → LLM generates response → Text Normalizer → TTS speaks
License
This model is released under CC BY-NC 4.0 — free for research and non-commercial use.
For commercial licensing, contact us at business@smao.ai.
About
Built by SMAO — Michael Müller and team.
For questions, issues, or commercial inquiries: business@smao.ai
Citation
@misc{saytext-2026,
title={SayText: Multilingual Text Normalization for TTS},
author={Michael Müller and SMAO AI},
year={2026},
url={https://huggingface.co/smaoai/saytext}
}
- Downloads last month
- 192
Evaluation results
- Sentence Accuracyself-reported94.200