SayText — Multilingual Text Normalization for TTS

A multilingual neural text normalization model for TTS (text-to-speech) pipelines. Converts written text to spoken form across 8 European languages using a fine-tuned ByT5-Base (580M parameters).

"Das kostet 12,50 €.""Das kostet zwölf Euro fünfzig."

Key Features

  • 8 languages: German, English, French, Italian, Portuguese, Spanish, Turkish, Swedish
  • 24 semiotic classes: cardinals, money, dates, time, phone numbers, percentages, units, passthrough, and more
  • Passthrough-aware: learns when NOT to normalize (plain text, already-spoken forms, technical identifiers)
  • Voice-agent optimized: designed for LLM → TN → TTS pipelines where input is always well-formatted
  • Byte-level: ByT5 processes raw UTF-8 bytes — no tokenizer vocabulary limitations, handles , , °C natively

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("smaoai/saytext")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
model.eval()

def normalize(text: str, language: str) -> str:
    """Normalize text for TTS. Language: de, en, fr, it, pt, es, tr, sv"""
    input_text = f"<{language}> {text}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=512, num_beams=1)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Examples
print(normalize("Das kostet 12,50 €.", "de"))
# → Das kostet zwölf Euro fünfzig.

print(normalize("The flight departs at 8:45 AM.", "en"))
# → The flight departs at eight forty-five AM.

print(normalize("Le prix est de 327,67 €.", "fr"))
# → Le prix est de trois cent vingt-sept euros et soixante-sept centimes.

print(normalize("Ich helfe Ihnen gerne weiter.", "de"))
# → Ich helfe Ihnen gerne weiter.  (passthrough — no normalization needed)

print(normalize("Wir verwenden Python 3.10 in unserem Projekt.", "de"))
# → Wir verwenden Python 3.10 in unserem Projekt.  (technical identifier preserved)

Available Formats

This repo includes the model in three formats:

Format Path Size Use case
PyTorch (default) model.safetensors 2.2 GB Development, fine-tuning, HuggingFace pipeline()
CTranslate2 FP16 ct2_float16/ 1.1 GB GPU production inference
CTranslate2 INT8 ct2_int8/ 556 MB CPU production inference, edge deployment
repo/
  model.safetensors          # PyTorch weights
  config.json                # Model architecture
  tokenizer_config.json      # Tokenizer config
  added_tokens.json          # Additional tokens
  generation_config.json     # Default generation parameters
  handler.py                 # HuggingFace Inference API handler
  ct2_float16/               # CTranslate2 FP16 (GPU)
    model.bin
    config.json
    shared_vocabulary.json
  ct2_int8/                  # CTranslate2 INT8 (CPU)
    model.bin
    config.json
    shared_vocabulary.json

Model Description

Why ByT5?

Text normalization deals with symbols, digits, and special characters (€12,50, +49(0)30, info@web.de) that subword tokenizers fragment unpredictably. ByT5 operates on raw UTF-8 bytes — every character is processed individually with no tokenizer artifacts. This is critical for:

  • Locale-sensitive formats: 1.500 means "one thousand five hundred" in German but "one point five" in English
  • Special symbols: , £, , %, °C are single bytes, not fragmented subwords
  • Phone numbers: +49 (0)30 12345678 stays intact byte-by-byte

Architecture

Component Details
Base model google/byt5-base
Parameters 580M
Type Encoder-decoder (seq2seq)
Tokenization Byte-level UTF-8 (no SentencePiece)
Max input length 512 bytes (~250 characters)
Max output length 512 bytes
Language conditioning Prefix token: <de>, <en>, <fr>, etc.

How It Works

The model takes text with a language prefix and outputs the spoken form:

Input:  <de> Am 03.04.2026 um 14:30 kostet der Flug 249,99 €.
Output: Am dritte April zweitausendsechsundzwanzig um vierzehn Uhr dreißig
        kostet der Flug zweihundertneunundvierzig Euro neunundneunzig.

For text that doesn't need normalization, the model learns to pass it through unchanged:

Input:  <en> Sure, I can help you with that.
Output: Sure, I can help you with that.

Training

The model was trained on 3M+ pairs across 8 languages using a custom two-layer data pipeline:

  1. Entity Sampler (deterministic) — Generates verified (written, spoken) pairs per semiotic class using locale-aware libraries. Every pair is programmatically verified correct.

  2. Sentence Generator (LLM-powered) — Natural sentence templates with placeholders, filled with entity sampler pairs at assembly time.

  3. Real-world enrichment — Additional real-world text sources, auto-labeled and validated, merged with synthetic data.

The training data covers 24 semiotic classes including passthrough (unchanged text), already-normalized forms, and technical identifiers that should not be normalized.

Training Configuration

Parameter Value
Base model google/byt5-base
Effective batch size 128
Learning rate 3e-4 (cosine schedule)
Precision bf16
Hardware NVIDIA A100 80GB
Best eval loss 0.000897
Framework HuggingFace Transformers + PyTorch

Evaluation

94.2% sentence-level exact match accuracy on a held-out test set (1,900 stratified sample across all languages and classes).

Language Accuracy
Swedish 96.7%
French 95.4%
Turkish 94.8%
Spanish 94.2%
English 93.8%
Italian 93.3%
German 92.9%
Portuguese 92.6%

Per-Class Accuracy (selected)

Class Accuracy
Cardinal 97.5%
Money 97.5%
Year 97.5%
Passthrough (plain) 100%
Don't normalize 97.5%
Phone numbers 90.0%
Multi-entity (MIXED) 51.2%

Inference

Input Format

The model expects a language prefix followed by the text:

<de> Das kostet 12,50 €.
<en> The price is $99.99.
<fr> Le prix est de 327,67 €.

The language prefix is required — it tells the model which normalization rules to apply.

CPU Inference (PyTorch)

Metric Value
Average latency ~600ms per sentence
Passthrough (no entities) ~300ms
Model size 2.2 GB

CPU Inference (CTranslate2 INT8)

Metric Value
Average latency (single) ~248ms
Average latency (batch of 10) ~120ms per sentence
Model size 556 MB
# Convert to CTranslate2 INT8
pip install ctranslate2
ct2-transformers-converter --model smaoai/saytext \
    --output_dir ct2_int8 --quantization int8

# Usage
import ctranslate2
from transformers import AutoTokenizer

translator = ctranslate2.Translator("ct2_int8", device="cpu", intra_threads=4)
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")

text = "<de> Das kostet 12,50 €."
tokens = tokenizer(text, return_tensors=None)["input_ids"]
token_strs = [tokenizer.decode([t]) for t in tokens]

result = translator.translate_batch([token_strs], beam_size=1, max_decoding_length=512)
output_ids = tokenizer.convert_tokens_to_ids(result[0].hypotheses[0])
print(tokenizer.decode(output_ids, skip_special_tokens=True))

GPU Inference

For production TTS pipelines, a small GPU is recommended for real-time latency:

Setup Single sentence Batch of 8
CTranslate2 FP16, GPU ~15-50ms ~8-15ms/sent
PyTorch FP32, GPU ~280ms ~63ms/sent

Supported Languages

Code Language Example Input Example Output
de German Das kostet 12,50 €. Das kostet zwölf Euro fünfzig.
en English The flight departs at 8:45 AM. The flight departs at eight forty-five AM.
fr French Le prix est de 327,67 €. Le prix est de trois cent vingt-sept euros et soixante-sept centimes.
it Italian La riunione è alle 14:30. La riunione è alle quattordici e trenta.
pt Portuguese Em 2023, o crescimento foi notável. Em dois mil e vinte e três, o crescimento foi notável.
es Spanish La fecha es el 20 de enero de 2025. La fecha es el veinte de enero de dos mil veinticinco.
tr Turkish Toplam 250 kişi katıldı. Toplam ikiyüzelli kişi katıldı.
sv Swedish Det väger 2,5 kg. Det väger två komma fem kilogram.

Limitations & Known Issues

  1. Multi-entity accuracy is 51% — sentences with 2-3 different entity types sometimes produce errors on one of the entities.
  2. Date format variants — handles standard formats well but struggles with abbreviated months, slash dates, and 2-digit years.
  3. German declension — sometimes produces nominative case instead of the correct accusative/dative.
  4. CPU latency — 250-600ms per sentence on CPU. Use GPU for real-time applications.
  5. Training not complete — trained for 1.29 epochs out of 3 planned. Further training would improve edge cases.

Not Designed For

  • Raw user input — optimized for well-formatted LLM output, not OCR text or social media
  • Acronym spellingBMW → B M W not included (TTS engines handle this natively via SSML)
  • Hashtags, IBANs, Roman numerals, sports scores — dropped as low-priority for voice agent use cases

Recommended Post-Processing

import re

def post_validate(input_text: str, output_text: str) -> bool:
    """Check if the model output is safe for TTS."""
    if re.search(r"\d", output_text):
        return False  # Fall back to rule-based normalization
    ratio = len(output_text) / max(len(input_text), 1)
    if ratio > 5.0 or ratio < 0.3:
        return False
    return True

Intended Use

This model is designed for text-to-speech preprocessing in voice agent / conversational AI pipelines:

User speaks → ASR → LLM generates response → Text Normalizer → TTS speaks

License

This model is released under CC BY-NC 4.0 — free for research and non-commercial use.

For commercial licensing, contact us at business@smao.ai.

About

Built by SMAO — Michael Müller and team.

For questions, issues, or commercial inquiries: business@smao.ai

Citation

@misc{saytext-2026,
  title={SayText: Multilingual Text Normalization for TTS},
  author={Michael Müller and SMAO AI},
  year={2026},
  url={https://huggingface.co/smaoai/saytext}
}
Downloads last month
192
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results