Works but with some error

by maglat - opened Jan 7

Jan 7

Many thanks for the German Mira variant. It works and is very fast! During my test, I encountered some error which are inside the generated output. for some words, it just speaks the last letter of the word a second time. This sounds a bit strange .
This is the story i generated:"Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh‑Sternchen hatte."

Huggingface do not accept my wav file as an attachement, so I uploaded it

https://jumpshare.com/s/AfA0FQ3TV66Ljuc90DBq

maglat

Jan 7

I used following Mira Server

https://github.com/Si-ris-B/MiraTTS-FastAPI-Docker

my example for this post was generated by a simple curl call

curl -X POST http://xxx.xxx.xxx.xxx:8234/v1/audio/speech
-H "Content-Type: application/json"
-d '{"model":"mira-tts","input":"Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh‑Sternchen hatte.","voice":"Margit"}'
--output Miraspeech.wav

SebastianBodza

Owner Jan 7

Can you also share the reference wav? Can you see the same happening in the arena (https://huggingface.co/spaces/SebastianBodza/Kartoffel-MiraToffel-TTS) with the Mira (1 Epoch) model? This is the same model.

maglat

Jan 8

•

edited Jan 8

Hey hey,

This is my reference audio file.

Using my reference audio in the arena with the same text, do not create this behaviour .

EDIT:I tested other voices without luck. I even tested on different GPUs ( I had the crazy idea maybe the GPU has a flaw)

All running on my LLM rig, ubuntu 24, different RTX3090.

When I start the container and generate something, this is the log:

==========
== CUDA ==

CUDA Version 12.8.1

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO: Started server process [1]
INFO: Waiting for application startup.
2026-01-08 07:45:29,871 [INFO] Service starting up...
2026-01-08 07:45:29,871 [INFO] Initializing MiraTTS Model from: SebastianBodza/MiraToffel_miraTTS_german
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 162279.62it/s]
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 32768.
2026-01-08 07:45:32,649 - lmdeploy - WARNING - turbomind.py:239 - get 219 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
Fetching 22 files: 100%|██████████| 22/22 [00:00<00:00, 29499.58it/s]
2026-01-08 07:45:35.202445997 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:35.202467168 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)
torch_dtype is deprecated! Use dtype instead!
2026-01-08 07:45:37.432100230 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2026-01-08 07:45:37.433639687 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.433647382 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37.580475960 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.580496800 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37,607 [INFO] Model loaded successfully.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-08 07:46:16,985 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:18,324 [INFO] Generated 5.46s audio in 2.36s (RTF: 2.32x)
INFO: 172.31.0.1:34128 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:18,537 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:19,737 [INFO] Generated 9.24s audio in 1.20s (RTF: 7.70x)
INFO: 172.31.0.1:40382 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:20,037 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:21,497 [INFO] Generated 11.42s audio in 1.46s (RTF: 7.82x)
INFO: 172.31.0.1:40398 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:22,181 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:23,239 [INFO] Generated 8.18s audio in 1.06s (RTF: 7.73x)
INFO: 172.31.0.1:40400 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:23,922 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:25,142 [INFO] Generated 9.50s audio in 1.22s (RTF: 7.78x)
INFO: 172.31.0.1:40414 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:25,976 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:26,552 [INFO] Generated 4.22s audio in 0.58s (RTF: 7.34x)

maglat

Jan 8

•

edited Jan 8

Hey hey,

This is my reference audio file.

Using my reference audio in the arena with the same text, do not create this behaviour .

EDIT:I tested other voices without luck. I even tested on different GPUs ( I had the crazy idea maybe the GPU has a flaw)

All running on my LLM rig, ubuntu 24, different RTX3090.

When I start the container and generate something, this is the log:

==========
== CUDA ==

CUDA Version 12.8.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO: Started server process [1]
INFO: Waiting for application startup.
2026-01-08 07:45:29,871 [INFO] Service starting up...
2026-01-08 07:45:29,871 [INFO] Initializing MiraTTS Model from: SebastianBodza/MiraToffel_miraTTS_german
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 162279.62it/s]
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 32768.
2026-01-08 07:45:32,649 - lmdeploy - WARNING - turbomind.py:239 - get 219 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
Fetching 22 files: 100%|██████████| 22/22 [00:00<00:00, 29499.58it/s]
2026-01-08 07:45:35.202445997 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:35.202467168 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
WeightNorm.apply(module, name, dim)
torch_dtype is deprecated! Use dtype instead!
2026-01-08 07:45:37.432100230 [W:onnxruntime:, transformer_memcpy.cc:111 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2026-01-08 07:45:37.433639687 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.433647382 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37.580475960 [W:onnxruntime:, session_state.cc:1316 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2026-01-08 07:45:37.580496800 [W:onnxruntime:, session_state.cc:1318 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2026-01-08 07:45:37,607 [INFO] Model loaded successfully.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-08 07:46:16,985 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:18,324 [INFO] Generated 5.46s audio in 2.36s (RTF: 2.32x)
INFO: 172.31.0.1:34128 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:18,537 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:19,737 [INFO] Generated 9.24s audio in 1.20s (RTF: 7.70x)
INFO: 172.31.0.1:40382 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:20,037 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:21,497 [INFO] Generated 11.42s audio in 1.46s (RTF: 7.82x)
INFO: 172.31.0.1:40398 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:22,181 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:23,239 [INFO] Generated 8.18s audio in 1.06s (RTF: 7.73x)
INFO: 172.31.0.1:40400 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:23,922 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:25,142 [INFO] Generated 9.50s audio in 1.22s (RTF: 7.78x)
INFO: 172.31.0.1:40414 - "POST /v1/audio/speech HTTP/1.1" 200 OK
2026-01-08 07:46:25,976 [INFO] Processing 1 chunks for voice 'Margit'...
2026-01-08 07:46:26,552 [INFO] Generated 4.22s audio in 0.58s (RTF: 7.34x)

The docker MiraTTS Server Image installed following dependencies (maybe there is an issue with incorrect verions. Maybe in torch, torchaudio, torchvision etc):

Use the requested CUDA 12.8.1 Devel image

FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

Set environment variables

ENV PYTHONUNBUFFERED=1
DEBIAN_FRONTEND=noninteractive
CUDA_HOME=/usr/local/cuda
LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
PATH=/usr/local/cuda/bin:${PATH}

Install system dependencies

RUN apt-get update && apt-get install -y --no-install-recommends
python3.10
python3-pip
python3-dev
git
ffmpeg
libsndfile1
build-essential
wget
&& rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/python3 /usr/bin/python
WORKDIR /app

Upgrade pip

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel

1. Install the PyTorch 2.8.0 Stack for CUDA 12.8

This satisfies the CVE-2025-32434 security requirement (>2.6.0)

RUN pip3 install --no-cache-dir
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
--index-url https://download.pytorch.org/whl/cu128

2. Install ONNX Runtime and Model dependencies

Note: Using version 1.20+ for compatibility with CUDA 12.8

RUN pip3 install --no-cache-dir
onnxruntime-gpu
transformers>=4.48.0
accelerate
omegaconf
einops
lmdeploy
librosa
fastapi
uvicorn
pydantic
soundfile
numpy

3. Install MiraTTS specific git dependencies

RUN pip3 install --no-cache-dir
"fastaudiosr @ git+https://github.com/ysharma3501/FlashSR.git"
"ncodec @ git+https://github.com/ysharma3501/FastBiCodec.git"

4. Copy and Install the MiraTTS project

COPY mira /app/mira
COPY README.md /app/
COPY pyproject.toml /app/
RUN pip3 install --no-cache-dir --no-deps -e .

5. Copy API code

COPY app /app/app

Ensure the voices directory exists inside the container

RUN mkdir -p /app/models /app/data/voices

EXPOSE 8000

Health check to ensure GPU is visible

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

SebastianBodza

Owner Jan 8

•

edited Jan 8

from dotenv import load_dotenv
load_dotenv()
from mira.model import MiraTTS
import torchaudio

mira_tts = MiraTTS("SebastianBodza/MiraToffel_miraTTS_german")
file = "Xwp6YdR7lpwxWsM4a0n7t.wav"

text = (
    "Ein kleiner Fuchs entdeckte im Wald einen leuchtenden Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige Lichtpunkte um ihn herum. Der Fuchs schnupperte die Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes Glüh Sternchen hatte."
)


mira_tts.set_params(top_p=0.95, top_k=50, temperature=0.9, max_new_tokens=2024, repetition_penalty=1.05, min_p=0.015)

context_tokens = mira_tts.encode_audio(file)

audio = mira_tts.generate(text, context_tokens)

torchaudio.save("output.wav", audio.float().unsqueeze(0), 48000)

Produces this Audio:

So seems to be the fastAPI. This is doing alot of weird stuff under the hood, splitting the text, and inferencing it per sentence. After that some overlay of the audiofiles. The problem is the splitter:
https://github.com/ysharma3501/MiraTTS/compare/main...Si-ris-B:MiraTTS-FastAPI-Docker:main#diff-536e7dc82ade018715d694320623297938da67aad6683dd48f849dce2c0e2227

I then used the weird splitter on your text and the splitter produces this:
Chunk 1 (175 chars):
→ Ein kleiner Fuchs entdeckte im m Wald einen leuchtenden n Pilz. Neugierig stupste er ihn an und plötzlich tanzten winzige e Lichtpunkte um ihn herum. Der Fuchs schnupperte die

Chunk 2 (91 chars):
→ e Luft, lachte und sprang fröhlich weiter, weil er nun sein eigenes s Glüh Sternchen hatte.

Claude found it ... the logic is interesting. It looks for sentences by using capitalization instead of punctuation like ".!?" etc.

FOUND IT! The bug is in this section of _split_into_sentences:
    if (i < len(text) - 2 and char.islower() and text[i + 1] == ' ' and
        text[i + 2].isupper() and len(current_sentence.split()) > 3):
This logic tries to detect sentence boundaries by looking for:
- lowercase letter → space → uppercase letter
The problem: In German, ALL NOUNS ARE CAPITALIZED! So this logic is splitting in the middle of sentences:
- "entdeckte im Wald" → splits after "im"
- "einen leuchtenden Pilz" → splits after "einen"
- "winzige Lichtpunkte" → splits after "winzige"
This logic was designed for English where mid-sentence capitals are rare, but it's completely wrong for German.

maglat

Jan 8

Thank you so much for finding out!
I created a fork and with the help of GPT5.2 adjusted the processor.py to make it work with German.
https://github.com/maglat/MiraTTS-FastAPI-Docker-German
I mentioned you off course to give you credits!

Klonkrieger2

Jan 20

@maglat

Thanks for your Docker image. when I try to generate an TTS audio via API call I get this error. I am trying to fix this since 2 days.
I can see GPU usage via nvidia-smi on the docker host, but after some seconds it stops and fails.

I also tried a local installation without docker with the same issue. Do you have any hints or ideas?

INFO: 172.18.0.1:41082 - "POST /v1/audio/speech HTTP/1.1" 500 Internal Server Error
2026-01-20 12:00:13,534 [INFO] Processing 1 chunks for voice 'german_voice'...
2026-01-20 12:00:18.312716209 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow

2026-01-20 12:00:18,312 [ERROR] Generation error: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow
Traceback (most recent call last):
File "/app/app/main.py", line 55, in create_speech
audio_data = await service.generate_audio(request.input, request.voice)
File "/app/app/service.py", line 106, in generate_audio
audio_tensor = self.model.codec.decode(response.text, context)
File "/usr/local/lib/python3.10/dist-packages/ncodec/codec.py", line 46, in decode
wav = self.audio_decoder.detokenize(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ncodec/decoder/model.py", line 35, in detokenize
x = self.processor_detokenizer.run(["preprocessed_output"], {"context_tokens": context_tokens, "speech_tokens": speech_tokens})
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 287, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'/out_project/Conv' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandleronnxruntime::OnnxRuntimeException::SafeIntOnOverflow() Integer overflow

INFO: 172.18.0.1:45362 - "POST /v1/audio/speech HTTP/1.1" 500 Internal Server Error

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Works but with some error

============ CUDA ==

============ CUDA ==

Use the requested CUDA 12.8.1 Devel image

Set environment variables

Install system dependencies

Upgrade pip

1. Install the PyTorch 2.8.0 Stack for CUDA 12.8

This satisfies the CVE-2025-32434 security requirement (>2.6.0)

2. Install ONNX Runtime and Model dependencies

Note: Using version 1.20+ for compatibility with CUDA 12.8

3. Install MiraTTS specific git dependencies

4. Copy and Install the MiraTTS project

5. Copy API code

Ensure the voices directory exists inside the container

Health check to ensure GPU is visible

==========
== CUDA ==

==========
== CUDA ==