SayText — Multilingual Text Normalization for TTS

A multilingual neural text normalization model for TTS (text-to-speech) pipelines. Converts written text to spoken form across 8 European languages using a fine-tuned ByT5-Base (580M parameters).

"Das kostet 12,50 €." → "Das kostet zwölf Euro fünfzig."

Key Features

8 languages: German, English, French, Italian, Portuguese, Spanish, Turkish, Swedish
24 semiotic classes: cardinals, money, dates, time, phone numbers, percentages, units, passthrough, and more
Passthrough-aware: learns when NOT to normalize (plain text, already-spoken forms, technical identifiers)
Voice-agent optimized: designed for LLM → TN → TTS pipelines where input is always well-formatted
Byte-level: ByT5 processes raw UTF-8 bytes — no tokenizer vocabulary limitations, handles €, ₺, °C natively

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("smaoai/saytext")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
model.eval()

def normalize(text: str, language: str) -> str:
    """Normalize text for TTS. Language: de, en, fr, it, pt, es, tr, sv"""
    input_text = f"<{language}> {text}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=512, num_beams=1)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Examples
print(normalize("Das kostet 12,50 €.", "de"))
# → Das kostet zwölf Euro fünfzig.

print(normalize("The flight departs at 8:45 AM.", "en"))
# → The flight departs at eight forty-five AM.

print(normalize("Le prix est de 327,67 €.", "fr"))
# → Le prix est de trois cent vingt-sept euros et soixante-sept centimes.

print(normalize("Ich helfe Ihnen gerne weiter.", "de"))
# → Ich helfe Ihnen gerne weiter.  (passthrough — no normalization needed)

print(normalize("Wir verwenden Python 3.10 in unserem Projekt.", "de"))
# → Wir verwenden Python 3.10 in unserem Projekt.  (technical identifier preserved)

Available Formats

This repo includes the model in three formats:

Format	Path	Size	Use case
PyTorch (default)	`model.safetensors`	2.2 GB	Development, fine-tuning, HuggingFace `pipeline()`
CTranslate2 FP16	`ct2_float16/`	1.1 GB	GPU production inference
CTranslate2 INT8	`ct2_int8/`	556 MB	CPU production inference, edge deployment

repo/
  model.safetensors          # PyTorch weights
  config.json                # Model architecture
  tokenizer_config.json      # Tokenizer config
  added_tokens.json          # Additional tokens
  generation_config.json     # Default generation parameters
  handler.py                 # HuggingFace Inference API handler
  ct2_float16/               # CTranslate2 FP16 (GPU)
    model.bin
    config.json
    shared_vocabulary.json
  ct2_int8/                  # CTranslate2 INT8 (CPU)
    model.bin
    config.json
    shared_vocabulary.json

Model Description

Why ByT5?

Text normalization deals with symbols, digits, and special characters (€12,50, +49(0)30, info@web.de) that subword tokenizers fragment unpredictably. ByT5 operates on raw UTF-8 bytes — every character is processed individually with no tokenizer artifacts. This is critical for:

Locale-sensitive formats: 1.500 means "one thousand five hundred" in German but "one point five" in English
Special symbols: €, £, ₺, %, °C are single bytes, not fragmented subwords
Phone numbers: +49 (0)30 12345678 stays intact byte-by-byte

Architecture

Component	Details
Base model	google/byt5-base
Parameters	580M
Type	Encoder-decoder (seq2seq)
Tokenization	Byte-level UTF-8 (no SentencePiece)
Max input length	512 bytes (~250 characters)
Max output length	512 bytes
Language conditioning	Prefix token: `<de>`, `<en>`, `<fr>`, etc.

How It Works

The model takes text with a language prefix and outputs the spoken form:

Input:  <de> Am 03.04.2026 um 14:30 kostet der Flug 249,99 €.
Output: Am dritte April zweitausendsechsundzwanzig um vierzehn Uhr dreißig
        kostet der Flug zweihundertneunundvierzig Euro neunundneunzig.

For text that doesn't need normalization, the model learns to pass it through unchanged:

Input:  <en> Sure, I can help you with that.
Output: Sure, I can help you with that.

Training

The model was trained on 3M+ pairs across 8 languages using a custom two-layer data pipeline:

Entity Sampler (deterministic) — Generates verified (written, spoken) pairs per semiotic class using locale-aware libraries. Every pair is programmatically verified correct.
Sentence Generator (LLM-powered) — Natural sentence templates with placeholders, filled with entity sampler pairs at assembly time.
Real-world enrichment — Additional real-world text sources, auto-labeled and validated, merged with synthetic data.

The training data covers 24 semiotic classes including passthrough (unchanged text), already-normalized forms, and technical identifiers that should not be normalized.

Training Configuration

Parameter	Value
Base model	google/byt5-base
Effective batch size	128
Learning rate	3e-4 (cosine schedule)
Precision	bf16
Hardware	NVIDIA A100 80GB
Best eval loss	0.000897
Framework	HuggingFace Transformers + PyTorch

Evaluation

94.2% sentence-level exact match accuracy on a held-out test set (1,900 stratified sample across all languages and classes).

Language	Accuracy
Swedish	96.7%
French	95.4%
Turkish	94.8%
Spanish	94.2%
English	93.8%
Italian	93.3%
German	92.9%
Portuguese	92.6%

Per-Class Accuracy (selected)

Class	Accuracy
Cardinal	97.5%
Money	97.5%
Year	97.5%
Passthrough (plain)	100%
Don't normalize	97.5%
Phone numbers	90.0%
Multi-entity (MIXED)	51.2%

Inference

Input Format

The model expects a language prefix followed by the text:

<de> Das kostet 12,50 €.
<en> The price is $99.99.
<fr> Le prix est de 327,67 €.

The language prefix is required — it tells the model which normalization rules to apply.

CPU Inference (PyTorch)

Metric	Value
Average latency	~600ms per sentence
Passthrough (no entities)	~300ms
Model size	2.2 GB

CPU Inference (CTranslate2 INT8)

Metric	Value
Average latency (single)	~248ms
Average latency (batch of 10)	~120ms per sentence
Model size	556 MB

# Convert to CTranslate2 INT8
pip install ctranslate2
ct2-transformers-converter --model smaoai/saytext \
    --output_dir ct2_int8 --quantization int8

# Usage
import ctranslate2
from transformers import AutoTokenizer

translator = ctranslate2.Translator("ct2_int8", device="cpu", intra_threads=4)
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")

text = "<de> Das kostet 12,50 €."
tokens = tokenizer(text, return_tensors=None)["input_ids"]
token_strs = [tokenizer.decode([t]) for t in tokens]

result = translator.translate_batch([token_strs], beam_size=1, max_decoding_length=512)
output_ids = tokenizer.convert_tokens_to_ids(result[0].hypotheses[0])
print(tokenizer.decode(output_ids, skip_special_tokens=True))

GPU Inference

For production TTS pipelines, a small GPU is recommended for real-time latency:

Setup	Single sentence	Batch of 8
CTranslate2 FP16, GPU	~15-50ms	~8-15ms/sent
PyTorch FP32, GPU	~280ms	~63ms/sent

Supported Languages

Code	Language	Example Input	Example Output
`de`	German	`Das kostet 12,50 €.`	`Das kostet zwölf Euro fünfzig.`
`en`	English	`The flight departs at 8:45 AM.`	`The flight departs at eight forty-five AM.`
`fr`	French	`Le prix est de 327,67 €.`	`Le prix est de trois cent vingt-sept euros et soixante-sept centimes.`
`it`	Italian	`La riunione è alle 14:30.`	`La riunione è alle quattordici e trenta.`
`pt`	Portuguese	`Em 2023, o crescimento foi notável.`	`Em dois mil e vinte e três, o crescimento foi notável.`
`es`	Spanish	`La fecha es el 20 de enero de 2025.`	`La fecha es el veinte de enero de dos mil veinticinco.`
`tr`	Turkish	`Toplam 250 kişi katıldı.`	`Toplam ikiyüzelli kişi katıldı.`
`sv`	Swedish	`Det väger 2,5 kg.`	`Det väger två komma fem kilogram.`

Limitations & Known Issues

Multi-entity accuracy is 51% — sentences with 2-3 different entity types sometimes produce errors on one of the entities.
Date format variants — handles standard formats well but struggles with abbreviated months, slash dates, and 2-digit years.
German declension — sometimes produces nominative case instead of the correct accusative/dative.
CPU latency — 250-600ms per sentence on CPU. Use GPU for real-time applications.
Training not complete — trained for 1.29 epochs out of 3 planned. Further training would improve edge cases.

Not Designed For

Raw user input — optimized for well-formatted LLM output, not OCR text or social media
Acronym spelling — BMW → B M W not included (TTS engines handle this natively via SSML)
Hashtags, IBANs, Roman numerals, sports scores — dropped as low-priority for voice agent use cases

Intended Use

This model is designed for text-to-speech preprocessing in voice agent / conversational AI pipelines:

User speaks → ASR → LLM generates response → Text Normalizer → TTS speaks

License

This model is released under CC BY-NC 4.0 — free for research and non-commercial use.

For commercial licensing, contact us at business@smao.ai.

About

Built by SMAO — Michael Müller and team.

For questions, issues, or commercial inquiries: business@smao.ai

Citation

@misc{saytext-2026,
  title={SayText: Multilingual Text Normalization for TTS},
  author={Michael Müller and SMAO AI},
  year={2026},
  url={https://huggingface.co/smaoai/saytext}
}

Downloads last month: 192

Safetensors

Model size

0.6B params

Tensor type

F32

Evaluation results

Sentence Accuracy
self-reported

94.200

smaoai
/

saytext