Commit ·
20e62b0
1
Parent(s): d15ad0d
Add SAC model 16k 37.5Hz, README, config
Browse files- README.md +102 -0
- SAC-16k-37_5Hz.pt +3 -0
- config.json +9 -0
README.md
CHANGED
|
@@ -1,3 +1,105 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
<div align="center">
|
| 6 |
+
<h1>
|
| 7 |
+
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
|
| 8 |
+
</h1>
|
| 9 |
+
|
| 10 |
+
<p>
|
| 11 |
+
<a href="https://github.com/Soul-AILab/SAC">
|
| 12 |
+
<img src="https://img.shields.io/badge/SAC-GitHub-black?logo=github&logoColor=white" alt="GitHub Repo">
|
| 13 |
+
</a>
|
| 14 |
+
<a href="https://sac-codec.github.io/">
|
| 15 |
+
<img src="https://img.shields.io/badge/🌐%20Demo-Page-brightgreen" alt="Demo Page">
|
| 16 |
+
</a>
|
| 17 |
+
<a href="https://arxiv.org/abs/2510.00000">
|
| 18 |
+
<img src="https://img.shields.io/badge/arXiv-2510.00000-blueviolet?logo=arxiv&logoColor=white" alt="arXiv">
|
| 19 |
+
</a>
|
| 20 |
+
<a href="https://huggingface.co/Soul/SAC">
|
| 21 |
+
<img src="https://img.shields.io/badge/🤗%20SAC-Models-yellow" alt="Hugging Face">
|
| 22 |
+
</a>
|
| 23 |
+
</p>
|
| 24 |
+
|
| 25 |
+
<p align="center">
|
| 26 |
+
<i>A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.</i>
|
| 27 |
+
</p>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
## 🛠️ Environment Setup
|
| 32 |
+
```bash
|
| 33 |
+
conda create -n sac python=3.10
|
| 34 |
+
conda activate sac
|
| 35 |
+
pip install -r requirements.txt # pip version == 24.0
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## 🧩 Model Checkpoints
|
| 40 |
+
|
| 41 |
+
To use SAC, you need to prepare the pretrained dependencies, including the [GLM-4-Voice-Tokenizer](https://huggingface.co/zai-org/glm-4-voice-tokenizer) for semantic tokenization and the [ERes2Net](https://modelscope.cn/models/iic/speech_eres2net_sv_en_voxceleb_16k) speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., `configs/xxx.yaml`).
|
| 42 |
+
|
| 43 |
+
The following table lists the available SAC checkpoints:
|
| 44 |
+
|
| 45 |
+
| Model Name | Hugging Face | Sample Rate | Token Rate | BPS |
|
| 46 |
+
|:-----------:|:------------:|:------------:|:-----------:|:---:|
|
| 47 |
+
| SAC | [🤗 Soul/SAC-16k-37.5hz](https://huggingface.co/Soul/SAC-16k-37.5hz) | 16 kHz | 37.5 Hz | 525 |
|
| 48 |
+
| SAC | [🤗 Soul/SAC-16k-62.5hz](https://huggingface.co/Soul/SAC-16k-62.5hz) | 16 kHz | 62.5 Hz | 875 |
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
## 🎧 Inference
|
| 52 |
+
|
| 53 |
+
To perform audio reconstruction, you can use the following command:
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
python -m bins.infer
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
We also provide batch scripts for [audio reconstruction](./scripts/batch/reconstruct.sh), [encoding](./scripts/batch/encode.sh), [decoding](./scripts/batch/decode.sh), and [embedding extraction](./scripts/batch/extract_embeddings.sh) in the `scripts/batch` directory as references (you can refer to the [batch scripts guide](./docs/batch_scripts_guide.md) for details).
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## 🧪 Evaluation
|
| 63 |
+
|
| 64 |
+
You can run the following command to perform evaluation:
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
bash scripts/eval.sh
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
For details on dataset preparation and evaluation setup, please first refer to the [evaluation guide](./docs/evaluation_guide.md).
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
## 🚀 Training
|
| 74 |
+
### Step 1: Prepare training data
|
| 75 |
+
Before training, organize your dataset in **JSONL** format. You can refer to `example/training_data.jsonl`. Each entry should include:
|
| 76 |
+
- **utt** — unique utterance ID (customizable)
|
| 77 |
+
- **wav_path** — path to raw audio
|
| 78 |
+
- **ssl_path** — path to offline-extracted Whisper features (for semantic supervision)
|
| 79 |
+
- **semantic_token_path** — path to offline-extracted semantic tokens
|
| 80 |
+
|
| 81 |
+
To accelerate training, you need to **extract semantic tokens and Whisper features offline** first before starting. Refer to the [feature extraction guide](./docs/feature_extraction_guide.md) for detailed instructions.
|
| 82 |
+
|
| 83 |
+
### Step 2: Modify configuration files
|
| 84 |
+
You can adjust training and DeepSpeed configurations by editing:
|
| 85 |
+
- [`configs/xxx.yaml`](./configs) — main training configuration
|
| 86 |
+
- [`configs/ds_stage2.json`](./configs/ds_stage2.json) — DeepSpeed configuration
|
| 87 |
+
|
| 88 |
+
### Step 3: Start training
|
| 89 |
+
Run the following script to start SAC training:
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
bash scripts/train.sh
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
## 🙏 Acknowledgement
|
| 97 |
+
Our codebase builds upon the awesome [SparkVox](https://github.com/SparkAudio/SparkVox) and [DAC](https://github.com/descriptinc/descript-audio-codec). We thank the authors for their excellent work.
|
| 98 |
+
|
| 99 |
+
## 🔖 Citation
|
| 100 |
+
If you find this work useful in your research, please consider citing:
|
| 101 |
+
```bibtex
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## 📜 License
|
| 105 |
+
This project is licensed under the Apache 2.0 License.
|
SAC-16k-37_5Hz.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:16ff29d557d7cd6be36358d2694bbcd83a3d79c3766bff4d9a9e99ee5523fae2
|
| 3 |
+
size 2553952728
|
config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "Soul-AILab/SAC-16k-37_5Hz",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"CustomModel"
|
| 5 |
+
],
|
| 6 |
+
"model_type": "custom",
|
| 7 |
+
"description": "SAC model at 37.5 Hz",
|
| 8 |
+
"sample_rate": 16000
|
| 9 |
+
}
|