mrfakename zhouyx1998 commited on
Commit
cc97f0e
·
verified ·
0 Parent(s):

Duplicate from openbmb/VoxCPM1.5

Browse files

Co-authored-by: Yixuan Zhou <zhouyx1998@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/voxcpm_model.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - openbmb/MiniCPM4-0.5B
8
+ pipeline_tag: text-to-speech
9
+ library_name: voxcpm1.5
10
+ tags:
11
+ - text-to-speech
12
+ - speech
13
+ - speech generation
14
+ - voice cloning
15
+ ---
16
+
17
+ ## 🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
18
+
19
+
20
+ [![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/OpenBMB/VoxCPM/) [![Technical Report](https://img.shields.io/badge/Technical%20Report-Arxiv-red)](https://arxiv.org/abs/2509.24650)[![Live Playground](https://img.shields.io/badge/Live%20PlayGround-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) [![Samples](https://img.shields.io/badge/Audio%20Samples-Page-green)](https://openbmb.github.io/VoxCPM-demopage)
21
+
22
+ - VoxCPM1.5
23
+
24
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow)](https://huggingface.co/openbmb/VoxCPM1.5) [![ModelScope](https://img.shields.io/badge/ModelScope-OpenBMB-purple)](https://modelscope.cn/models/OpenBMB/VoxCPM1.5)
25
+
26
+
27
+ <div align="center">
28
+ <img src="assets/voxcpm_logo.png" alt="VoxCPM Logo" width="40%">
29
+ </div>
30
+
31
+ ## 🎉 VoxCPM1.5 Updates
32
+
33
+ **Release Date:** December 5, 2025
34
+
35
+ VoxCPM1.5 brings improvements in audio quality and efficiency:
36
+
37
+ | Feature | VoxCPM | VoxCPM1.5 |
38
+ |---------|------------|------------|
39
+ | **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
40
+ | **LM Token Rate** | 12.5Hz | 6.25Hz |
41
+ | **Patch Size** | 2 | 4 |
42
+ | **SFT Support** | ✅ | ✅ |
43
+ | **LoRA Support** | ✅ | ✅ |
44
+
45
+ **Key Improvements:**
46
+ - 🔊 **Higher Quality**: 44.1kHz sampling rate preserves more high-frequency details for better voice cloning
47
+ - ⚡ **More Efficient**: Reduced token rate (6.25Hz) lowers computational cost while maintaining performance
48
+ - 🎓 **Fine-tuning Support**: Train personalized voice models with SFT or LoRA
49
+
50
+ **Note**: Output quality depends on the prompt speech quality. VoxCPM-0.5B remains fully supported with backward compatibility.
51
+
52
+
53
+ ## 📚 Model Overview
54
+
55
+
56
+ VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.
57
+
58
+ Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on [MiniCPM-4](https://huggingface.co/openbmb/MiniCPM4-0.5B) backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.
59
+
60
+
61
+ <div align="center">
62
+ <img src="assets/voxcpm_model.png" alt="VoxCPM Model Architecture" width="90%">
63
+ </div>
64
+
65
+
66
+ ### 🚀 Key Features
67
+ - **Context-Aware, Expressive Speech Generation** - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.
68
+ - **True-to-Life Voice Cloning** - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.
69
+ - **High-Efficiency Synthesis** - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.
70
+
71
+
72
+
73
+
74
+ ## Quick Start
75
+
76
+ ### 🔧 Install from PyPI
77
+ ``` sh
78
+ pip install voxcpm
79
+ ```
80
+ ### 1. Model Download (Optional)
81
+ By default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.
82
+ - Download VoxCPM1.5
83
+ ```
84
+ from huggingface_hub import snapshot_download
85
+ snapshot_download("openbmb/VoxCPM1.5")
86
+ ```
87
+
88
+ - Or Download VoxCPM-0.5B
89
+ ```
90
+ from huggingface_hub import snapshot_download
91
+ snapshot_download("openbmb/VoxCPM-0.5B")
92
+ ```
93
+ - Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
94
+ ```
95
+ from modelscope import snapshot_download
96
+ snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
97
+ snapshot_download('iic/SenseVoiceSmall')
98
+ ```
99
+
100
+ ### 2. Basic Usage
101
+ ```python
102
+ import soundfile as sf
103
+ import numpy as np
104
+ from voxcpm import VoxCPM
105
+
106
+ model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")
107
+
108
+ # Non-streaming
109
+ wav = model.generate(
110
+ text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
111
+ prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
112
+ prompt_text=None, # optional: reference text
113
+ cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
114
+ inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
115
+ normalize=False, # enable external TN tool, but will disable native raw text support
116
+ denoise=False, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
117
+ retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
118
+ retry_badcase_max_times=3, # maximum retrying times
119
+ retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
120
+ )
121
+
122
+ sf.write("output.wav", wav, model.tts_model.sample_rate)
123
+ print("saved: output.wav")
124
+
125
+ # Streaming
126
+ chunks = []
127
+ for chunk in model.generate_streaming(
128
+ text = "Streaming text to speech is easy with VoxCPM!",
129
+ # supports same args as above
130
+ ):
131
+ chunks.append(chunk)
132
+ wav = np.concatenate(chunks)
133
+
134
+ sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
135
+ print("saved: output_streaming.wav")
136
+ ```
137
+
138
+ ### 3. CLI Usage
139
+
140
+ After installation, the entry point is `voxcpm` (or use `python -m voxcpm.cli`).
141
+
142
+ ```bash
143
+ # 1) Direct synthesis (single text)
144
+ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." --output out.wav
145
+
146
+ # 2) Voice cloning (reference audio + transcript)
147
+ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
148
+ --prompt-audio path/to/voice.wav \
149
+ --prompt-text "reference transcript" \
150
+ --output out.wav \
151
+ # --denoise
152
+
153
+ # (Optinal) Voice cloning (reference audio + transcript file)
154
+ voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
155
+ --prompt-audio path/to/voice.wav \
156
+ --prompt-file "/path/to/text-file" \
157
+ --output out.wav \
158
+ # --denoise
159
+
160
+ # 3) Batch processing (one text per line)
161
+ voxcpm --input examples/input.txt --output-dir outs
162
+ # (optional) Batch + cloning
163
+ voxcpm --input examples/input.txt --output-dir outs \
164
+ --prompt-audio path/to/voice.wav \
165
+ --prompt-text "reference transcript" \
166
+ # --denoise
167
+
168
+ # 4) Inference parameters (quality/speed)
169
+ voxcpm --text "..." --output out.wav \
170
+ --cfg-value 2.0 --inference-timesteps 10 --normalize
171
+
172
+ # 5) Model loading
173
+ # Prefer local path
174
+ voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
175
+ # Or from Hugging Face (auto download/cache)
176
+ voxcpm --text "..." --output out.wav \
177
+ --hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only
178
+
179
+ # 6) Denoiser control
180
+ voxcpm --text "..." --output out.wav \
181
+ --no-denoiser --zipenhancer-path iic/speech_zipenhancer_ans_multiloss_16k_base
182
+
183
+ # 7) Help
184
+ voxcpm --help
185
+ python -m voxcpm.cli --help
186
+ ```
187
+
188
+ ### 4. Start web demo
189
+
190
+ You can start the UI interface by running `python app.py`, which allows you to perform Voice Cloning and Voice Creation.
191
+
192
+ ### 5. Fine-tuning
193
+
194
+ VoxCPM1.5 supports both full fine-tuning (SFT) and LoRA fine-tuning, allowing you to train personalized voice models on your own data. See the [Fine-tuning Guide](docs/finetune.md) for detailed instructions.
195
+
196
+ **Quick Start:**
197
+ ```bash
198
+ # Full fine-tuning
199
+ python scripts/train_voxcpm_finetune.py \
200
+ --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
201
+
202
+ # LoRA fine-tuning
203
+ python scripts/train_voxcpm_finetune.py \
204
+ --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
205
+ ```
206
+
207
+
208
+ ## 👩‍🍳 A Voice Chef's Guide
209
+ Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let’s begin.
210
+
211
+ ---
212
+ ### 🥚 Step 1: Prepare Your Base Ingredients (Content)
213
+
214
+ First, choose how you’d like to input your text:.
215
+ 1. Regular Text (Classic Mode)
216
+ - ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
217
+ 2. Phoneme Input (Native Mode)
218
+ - ❌ Turn "Text Normalization" OFF. Enter phoneme text like {HH AH0 L OW1} (EN) or {ni3}{hao3} (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
219
+ - **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
220
+
221
+
222
+ ---
223
+ ### 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
224
+
225
+ This is the secret sauce that gives your audio its unique sound.
226
+
227
+ #### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
228
+ - A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
229
+ - **For a Clean, Denoising Voice:**
230
+ - ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. However, this will limit the audio sampling rate to 16kHz, restricting the cloning quality ceiling.
231
+ - **For High-Quality Audio Cloning (Up to 44.1kHz):**
232
+ - ❌ Disable "Prompt Speech Enhancement" to preserve all original audio information, including background atmosphere, and support audio cloning up to 44.1kHz sampling rate.
233
+
234
+ #### 2. Cooking au Naturel (Letting the Model Improvise)
235
+ - If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
236
+ - **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
237
+
238
+ ---
239
+ ### 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
240
+
241
+ You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
242
+
243
+ #### CFG Value (How Closely to Follow the Recipe)
244
+ - **Default**: A great starting point.
245
+ - **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
246
+ - **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
247
+ - **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
248
+ - **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
249
+
250
+ #### Inference Timesteps (Simmering Time: Quality vs. Speed)
251
+ - **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
252
+ - **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
253
+
254
+ ---
255
+ Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
256
+
257
+
258
+ ---
259
+
260
+
261
+ ## ⚠️ Risks and limitations
262
+ - General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.
263
+ - Potential for Misuse of Voice Cloning: VoxCPM's powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.
264
+ - Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. Furthermore, the current version offers limited direct control over specific speech attributes like emotion or speaking style.
265
+ - Bilingual Model: VoxCPM is trained primarily on Chinese and English data. Performance on other languages is not guaranteed and may result in unpredictable or low-quality audio.
266
+ - This model is released for research and development purposes only. We do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.
267
+
268
+
269
+
270
+ ## 📄 License
271
+ The VoxCPM model weights and code are open-sourced under the Apache-2.0 license.
272
+
assets/voxcpm_logo.png ADDED
assets/voxcpm_model.png ADDED

Git LFS Details

  • SHA256: 49f6eb7998135ad49f5dd0ee1fa2c099d79a016ab59fe29fc039f7f32ef8f5ca
  • Pointer size: 131 Bytes
  • Size of remote file: 145 kB
audiovae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8160987a7f10acd4c49b72a9a3e974c0b5c1c786538cd4bcaac5f2df388c0f14
3
+ size 346023712
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "voxcpm",
3
+ "lm_config": {
4
+ "bos_token_id": 1,
5
+ "eos_token_id": 2,
6
+ "hidden_size": 1024,
7
+ "intermediate_size": 4096,
8
+ "max_position_embeddings": 32768,
9
+ "num_attention_heads": 16,
10
+ "num_hidden_layers": 24,
11
+ "num_key_value_heads": 2,
12
+ "rms_norm_eps": 1e-05,
13
+ "rope_theta": 10000,
14
+ "rope_scaling": {
15
+ "type": "longrope",
16
+ "long_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
17
+ "short_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
18
+ "original_max_position_embeddings": 32768
19
+ },
20
+ "vocab_size": 73448,
21
+ "scale_emb": 12,
22
+ "dim_model_base": 256,
23
+ "scale_depth": 1.4,
24
+ "use_mup": false
25
+ },
26
+ "patch_size": 4,
27
+ "feat_dim": 64,
28
+ "scalar_quantization_latent_dim": 256,
29
+ "scalar_quantization_scale": 9,
30
+ "residual_lm_num_layers": 8,
31
+ "encoder_config": {
32
+ "hidden_dim": 1024,
33
+ "ffn_dim": 4096,
34
+ "num_heads": 16,
35
+ "num_layers": 8
36
+ },
37
+ "dit_config": {
38
+ "hidden_dim": 1024,
39
+ "ffn_dim": 4096,
40
+ "num_heads": 16,
41
+ "num_layers": 8,
42
+ "cfm_config": {
43
+ "sigma_min": 1e-06,
44
+ "solver": "euler",
45
+ "t_scheduler": "log-norm",
46
+ "inference_cfg_rate": 2.0
47
+ }
48
+ },
49
+ "audio_vae_config": {
50
+ "encoder_dim": 64,
51
+ "encoder_rates": [2, 3, 6, 7, 7],
52
+ "latent_dim": 64,
53
+ "decoder_dim": 2048,
54
+ "decoder_rates": [7, 7, 6, 3, 2],
55
+ "sample_rate": 44100
56
+ },
57
+ "max_length": 8192,
58
+ "device": "cuda",
59
+ "dtype": "bfloat16"
60
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e97ee2cd2ca891662f7da368921df08368a616410afe67461253e4dcd3823cc
3
+ size 1603516600
special_tokens_map.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<|im_end|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<|im_start|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<|tool_call|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "<|execute_start|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ {
32
+ "content": "<|execute_end|>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ {
39
+ "content": "<|fim_prefix|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ {
46
+ "content": "<|fim_middle|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ {
53
+ "content": "<|fim_suffix|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ }
59
+ ],
60
+ "bos_token": {
61
+ "content": "<s>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false
66
+ },
67
+ "eos_token": {
68
+ "content": "</s>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false
73
+ },
74
+ "unk_token": {
75
+ "content": "<unk>",
76
+ "lstrip": false,
77
+ "normalized": false,
78
+ "rstrip": false,
79
+ "single_word": false
80
+ }
81
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "101": {
30
+ "content": "<|audio_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "102": {
38
+ "content": "<|audio_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "103": {
46
+ "content": "<|audio_prompt_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "104": {
54
+ "content": "<|audio_prompt_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "105": {
62
+ "content": "<|background|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "106": {
70
+ "content": "<|/background|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "107": {
78
+ "content": "<|characters|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "108": {
86
+ "content": "<|/characters|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "109": {
94
+ "content": "<|speaker_id|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "110": {
102
+ "content": "<|/speaker_id|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "111": {
110
+ "content": "<|span|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "112": {
118
+ "content": "<|/span|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "73440": {
126
+ "content": "<|im_end|>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "73441": {
134
+ "content": "<|im_start|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "73442": {
142
+ "content": "<|tool_call|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "73443": {
150
+ "content": "<|execute_start|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "73444": {
158
+ "content": "<|execute_end|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "73445": {
166
+ "content": "<|fim_prefix|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "73446": {
174
+ "content": "<|fim_middle|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ },
181
+ "73447": {
182
+ "content": "<|fim_suffix|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ }
189
+ },
190
+ "additional_special_tokens": [
191
+ "<|im_end|>",
192
+ "<|im_start|>",
193
+ "<|tool_call|>",
194
+ "<|execute_start|>",
195
+ "<|execute_end|>",
196
+ "<|fim_prefix|>",
197
+ "<|fim_middle|>",
198
+ "<|fim_suffix|>"
199
+ ],
200
+ "bos_token": "<s>",
201
+ "clean_up_tokenization_spaces": false,
202
+ "eos_token": "<|im_end|>",
203
+ "legacy": true,
204
+ "model_max_length": 1000000000000000019884624838656,
205
+ "pad_token": null,
206
+ "sp_model_kwargs": {},
207
+ "spaces_between_special_tokens": false,
208
+ "tokenizer_class": "LlamaTokenizer",
209
+ "unk_token": "<unk>",
210
+ "use_default_system_prompt": false,
211
+ "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
212
+ }