Works but with some error
Many thanks for the German Mira variant. It works and is very fast! During my test, I encountered some error which are inside the generated output. for some words, it just speaks the last letter of the word a second time. This sounds a bit strange .
This is the story i generated:"Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh‑Sternchen hatte."
Huggingface do not accept my wav file as an attachement, so I uploaded it
I used following Mira Server
https://github.com/Si-ris-B/MiraTTS-FastAPI-Docker
my example for this post was generated by a simple curl call
curl -X POST http://xxx.xxx.xxx.xxx:8234/v1/audio/speech
-H "Content-Type: application/json"
-d '{"model":"mira-tts","input":"Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh‑Sternchen hatte.","voice":"Margit"}'
--output Miraspeech.wav
Can you also share the reference wav? Can you see the same happening in the arena (https://huggingface.co/spaces/SebastianBodza/Kartoffel-MiraToffel-TTS) with the Mira (1 Epoch) model? This is the same model.
Hey hey,
This is my reference audio file.
Using my reference audio in the arena with the same text, do not create this behaviour .
EDIT:I tested other voices without luck. I even tested on different GPUs ( I had the crazy idea maybe the GPU has a flaw)
All running on my LLM rig, ubuntu 24, different RTX3090.
When I start the container and generate something, this is the log:
==========
== CUDA ==
CUDA Version 12.8.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
INFO: Started server process [1]
INFO: Waiting for application startup.
2026-01-08 07:45:29,871 [INFO] Service starting up...
2026-01-08 07:45:29,871 [INFO] Initializing MiraTTS Model from: SebastianBodza/MiraToffel_miraTTS_german
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 162279.62it/s]
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 32768.
2026-01-08 07:45:32,649 - lmdeploy - WARNING - turbomind.py:239 - get 219 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
Fetching 22 files: 100%|██████████| 22/22 [00:00<00:00, 29499.58it/s]
2026-01-08 07:45:35.202445997 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:35.202467168 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)torch_dtype is deprecated! Use dtype instead!
2026-01-08 07:45:37.432100230 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2026-01-08 07:45:37.433639687 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.433647382 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37.580475960 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.580496800 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37,607 [INFO] Model loaded successfully.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-08 07:46:16,985 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:18,324 [INFO] Generated 5.46s audio in 2.36s (RTF: 2.32x)
INFO: 172.31.0.1:34128 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:18,537 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:19,737 [INFO] Generated 9.24s audio in 1.20s (RTF: 7.70x)
INFO: 172.31.0.1:40382 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:20,037 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:21,497 [INFO] Generated 11.42s audio in 1.46s (RTF: 7.82x)
INFO: 172.31.0.1:40398 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:22,181 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:23,239 [INFO] Generated 8.18s audio in 1.06s (RTF: 7.73x)
INFO: 172.31.0.1:40400 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:23,922 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:25,142 [INFO] Generated 9.50s audio in 1.22s (RTF: 7.78x)
INFO: 172.31.0.1:40414 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:25,976 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:26,552 [INFO] Generated 4.22s audio in 0.58s (RTF: 7.34x)
Hey hey,
This is my reference audio file.
Using my reference audio in the arena with the same text, do not create this behaviour .
EDIT:I tested other voices without luck. I even tested on different GPUs ( I had the crazy idea maybe the GPU has a flaw)
All running on my LLM rig, ubuntu 24, different RTX3090.
When I start the container and generate something, this is the log:
==========
== CUDA ==CUDA Version 12.8.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-licenseA copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
INFO: Started server process [1]
INFO: Waiting for application startup.
2026-01-08 07:45:29,871 [INFO] Service starting up...
2026-01-08 07:45:29,871 [INFO] Initializing MiraTTS Model from: SebastianBodza/MiraToffel_miraTTS_german
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 162279.62it/s]
[TM][WARNING] [LlamaTritonModel]max_context_token_numis not set, default to 32768.
2026-01-08 07:45:32,649 - lmdeploy - WARNING - turbomind.py:239 - get 219 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
Fetching 22 files: 100%|██████████| 22/22 [00:00<00:00, 29499.58it/s]
2026-01-08 07:45:35.202445997 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:35.202467168 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:144: FutureWarning:torch.nn.utils.weight_normis deprecated in favor oftorch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)torch_dtypeis deprecated! Usedtypeinstead!
2026-01-08 07:45:37.432100230 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2026-01-08 07:45:37.433639687 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.433647382 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37.580475960 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.580496800 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37,607 [INFO] Model loaded successfully.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-08 07:46:16,985 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:18,324 [INFO] Generated 5.46s audio in 2.36s (RTF: 2.32x)
INFO: 172.31.0.1:34128 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:18,537 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:19,737 [INFO] Generated 9.24s audio in 1.20s (RTF: 7.70x)
INFO: 172.31.0.1:40382 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:20,037 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:21,497 [INFO] Generated 11.42s audio in 1.46s (RTF: 7.82x)
INFO: 172.31.0.1:40398 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:22,181 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:23,239 [INFO] Generated 8.18s audio in 1.06s (RTF: 7.73x)
INFO: 172.31.0.1:40400 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:23,922 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:25,142 [INFO] Generated 9.50s audio in 1.22s (RTF: 7.78x)
INFO: 172.31.0.1:40414 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:25,976 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:26,552 [INFO] Generated 4.22s audio in 0.58s (RTF: 7.34x)
The docker MiraTTS Server Image installed following dependencies (maybe there is an issue with incorrect verions. Maybe in torch, torchaudio, torchvision etc):
Use the requested CUDA 12.8.1 Devel image
FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04
Set environment variables
ENV PYTHONUNBUFFERED=1
DEBIAN_FRONTEND=noninteractive
CUDA_HOME=/usr/local/cuda
LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
PATH=/usr/local/cuda/bin:${PATH}
Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends
python3.10
python3-pip
python3-dev
git
ffmpeg
libsndfile1
build-essential
wget
&& rm -rf /var/lib/apt/lists/*
RUN ln -sf /usr/bin/python3 /usr/bin/python
WORKDIR /app
Upgrade pip
RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel
1. Install the PyTorch 2.8.0 Stack for CUDA 12.8
This satisfies the CVE-2025-32434 security requirement (>2.6.0)
RUN pip3 install --no-cache-dir
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
--index-url https://download.pytorch.org/whl/cu128
2. Install ONNX Runtime and Model dependencies
Note: Using version 1.20+ for compatibility with CUDA 12.8
RUN pip3 install --no-cache-dir
onnxruntime-gpu
transformers>=4.48.0
accelerate
omegaconf
einops
lmdeploy
librosa
fastapi
uvicorn
pydantic
soundfile
numpy
3. Install MiraTTS specific git dependencies
RUN pip3 install --no-cache-dir
"fastaudiosr @ git+https://github.com/ysharma3501/FlashSR.git"
"ncodec @ git+https://github.com/ysharma3501/FastBiCodec.git"
4. Copy and Install the MiraTTS project
COPY mira /app/mira
COPY README.md /app/
COPY pyproject.toml /app/
RUN pip3 install --no-cache-dir --no-deps -e .
5. Copy API code
COPY app /app/app
Ensure the voices directory exists inside the container
RUN mkdir -p /app/models /app/data/voices
EXPOSE 8000
Health check to ensure GPU is visible
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
from dotenv import load_dotenv
load_dotenv()
from mira.model import MiraTTS
import torchaudio
mira_tts = MiraTTS("SebastianBodza/MiraToffel_miraTTS_german")
file = "Xwp6YdR7lpwxWsM4a0n7t.wav"
text = (
"Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh Sternchen hatte."
)
mira_tts.set_params(top_p=0.95, top_k=50, temperature=0.9, max_new_tokens=2024, repetition_penalty=1.05, min_p=0.015)
context_tokens = mira_tts.encode_audio(file)
audio = mira_tts.generate(text, context_tokens)
torchaudio.save("output.wav", audio.float().unsqueeze(0), 48000)
Produces this Audio:
So seems to be the fastAPI. This is doing alot of weird stuff under the hood, splitting the text, and inferencing it per sentence. After that some overlay of the audiofiles. The problem is the splitter:
https://github.com/ysharma3501/MiraTTS/compare/main...Si-ris-B:MiraTTS-FastAPI-Docker:main#diff-536e7dc82ade018715d694320623297938da67aad6683dd48f849dce2c0e2227
I then used the weird splitter on your text and the splitter produces this:
Chunk 1 (175 chars):
→ Ein kleiner Fuchs entdeckte im m Wald einen leuchtenden n Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige e Lichtpunkte um ihn herum. Der Fuchs schnupperte die
Chunk 2 (91 chars):
→ e Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes s Glüh Sternchen hatte.
Claude found it ... the logic is interesting. It looks for sentences by using capitalization instead of punctuation like ".!?" etc.
FOUND IT! The bug is in this section of _split_into_sentences:
if (i < len(text) - 2 and char.islower() and text[i + 1] == ' ' and
text[i + 2].isupper() and len(current_sentence.split()) > 3):
This logic tries to detect sentence boundaries by looking for:
- lowercase letter → space → uppercase letter
The problem: In German, ALL NOUNS ARE CAPITALIZED! So this logic is splitting in the middle of sentences:
- "entdeckte im Wald" → splits after "im"
- "einen leuchtenden Pilz" → splits after "einen"
- "winzige Lichtpunkte" → splits after "winzige"
This logic was designed for English where mid-sentence capitals are rare, but it's completely wrong for German.
Thank you so much for finding out!
I created a fork and with the help of GPT5.2 adjusted the processor.py to make it work with German.
https://github.com/maglat/MiraTTS-FastAPI-Docker-German
I mentioned you off course to give you credits!
Thanks for your Docker image. when I try to generate an TTS audio via API call I get this error. I am trying to fix this since 2 days.
I can see GPU usage via nvidia-smi on the docker host, but after some seconds it stops and fails.
I also tried a local installation without docker with the same issue. Do you have any hints or ideas?
INFO: 172.18.0.1:41082 - "POST /v1/audio/speech HTTP/1.1" 500 Internal Server Error
2026-01-20 12:00:13,534 [INFO] Processing 1 chunks for voice 'german_voice'...
2026-01-20 12:00:18.312716209 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow
2026-01-20 12:00:18,312 [ERROR] Generation error: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow
Traceback (most recent call last):
File "/app/app/main.py", line 55, in create_speech
audio_data = await service.generate_audio(request.input, request.voice)
File "/app/app/service.py", line 106, in generate_audio
audio_tensor = self.model.codec.decode(response.text, context)
File "/usr/local/lib/python3.10/dist-packages/ncodec/codec.py", line 46, in decode
wav = self.audio_decoder.detokenize(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ncodec/decoder/model.py", line 35, in detokenize
x = self.processor_detokenizer.run(["preprocessed_output"], {"context_tokens": context_tokens, "speech_tokens": speech_tokens})
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 287, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow
INFO: 172.18.0.1:45362 - "POST /v1/audio/speech HTTP/1.1" 500 Internal Server Error