SII-WenxiChen commited on
Commit
20e62b0
·
1 Parent(s): d15ad0d

Add SAC model 16k 37.5Hz, README, config

Browse files
Files changed (3) hide show
  1. README.md +102 -0
  2. SAC-16k-37_5Hz.pt +3 -0
  3. config.json +9 -0
README.md CHANGED
@@ -1,3 +1,105 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ <div align="center">
6
+ <h1>
7
+ SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
8
+ </h1>
9
+
10
+ <p>
11
+ <a href="https://github.com/Soul-AILab/SAC">
12
+ <img src="https://img.shields.io/badge/SAC-GitHub-black?logo=github&logoColor=white" alt="GitHub Repo">
13
+ </a>
14
+ <a href="https://sac-codec.github.io/">
15
+ <img src="https://img.shields.io/badge/🌐%20Demo-Page-brightgreen" alt="Demo Page">
16
+ </a>
17
+ <a href="https://arxiv.org/abs/2510.00000">
18
+ <img src="https://img.shields.io/badge/arXiv-2510.00000-blueviolet?logo=arxiv&logoColor=white" alt="arXiv">
19
+ </a>
20
+ <a href="https://huggingface.co/Soul/SAC">
21
+ <img src="https://img.shields.io/badge/🤗%20SAC-Models-yellow" alt="Hugging Face">
22
+ </a>
23
+ </p>
24
+
25
+ <p align="center">
26
+ <i>A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.</i>
27
+ </p>
28
+ </div>
29
+
30
+
31
+ ## 🛠️ Environment Setup
32
+ ```bash
33
+ conda create -n sac python=3.10
34
+ conda activate sac
35
+ pip install -r requirements.txt # pip version == 24.0
36
+ ```
37
+
38
+
39
+ ## 🧩 Model Checkpoints
40
+
41
+ To use SAC, you need to prepare the pretrained dependencies, including the [GLM-4-Voice-Tokenizer](https://huggingface.co/zai-org/glm-4-voice-tokenizer) for semantic tokenization and the [ERes2Net](https://modelscope.cn/models/iic/speech_eres2net_sv_en_voxceleb_16k) speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., `configs/xxx.yaml`).
42
+
43
+ The following table lists the available SAC checkpoints:
44
+
45
+ | Model Name | Hugging Face | Sample Rate | Token Rate | BPS |
46
+ |:-----------:|:------------:|:------------:|:-----------:|:---:|
47
+ | SAC | [🤗 Soul/SAC-16k-37.5hz](https://huggingface.co/Soul/SAC-16k-37.5hz) | 16 kHz | 37.5 Hz | 525 |
48
+ | SAC | [🤗 Soul/SAC-16k-62.5hz](https://huggingface.co/Soul/SAC-16k-62.5hz) | 16 kHz | 62.5 Hz | 875 |
49
+
50
+
51
+ ## 🎧 Inference
52
+
53
+ To perform audio reconstruction, you can use the following command:
54
+
55
+ ```bash
56
+ python -m bins.infer
57
+ ```
58
+
59
+ We also provide batch scripts for [audio reconstruction](./scripts/batch/reconstruct.sh), [encoding](./scripts/batch/encode.sh), [decoding](./scripts/batch/decode.sh), and [embedding extraction](./scripts/batch/extract_embeddings.sh) in the `scripts/batch` directory as references (you can refer to the [batch scripts guide](./docs/batch_scripts_guide.md) for details).
60
+
61
+
62
+ ## 🧪 Evaluation
63
+
64
+ You can run the following command to perform evaluation:
65
+
66
+ ```bash
67
+ bash scripts/eval.sh
68
+ ```
69
+
70
+ For details on dataset preparation and evaluation setup, please first refer to the [evaluation guide](./docs/evaluation_guide.md).
71
+
72
+
73
+ ## 🚀 Training
74
+ ### Step 1: Prepare training data
75
+ Before training, organize your dataset in **JSONL** format. You can refer to `example/training_data.jsonl`. Each entry should include:
76
+ - **utt** — unique utterance ID (customizable)
77
+ - **wav_path** — path to raw audio
78
+ - **ssl_path** — path to offline-extracted Whisper features (for semantic supervision)
79
+ - **semantic_token_path** — path to offline-extracted semantic tokens
80
+
81
+ To accelerate training, you need to **extract semantic tokens and Whisper features offline** first before starting. Refer to the [feature extraction guide](./docs/feature_extraction_guide.md) for detailed instructions.
82
+
83
+ ### Step 2: Modify configuration files
84
+ You can adjust training and DeepSpeed configurations by editing:
85
+ - [`configs/xxx.yaml`](./configs) — main training configuration
86
+ - [`configs/ds_stage2.json`](./configs/ds_stage2.json) — DeepSpeed configuration
87
+
88
+ ### Step 3: Start training
89
+ Run the following script to start SAC training:
90
+
91
+ ```bash
92
+ bash scripts/train.sh
93
+ ```
94
+
95
+
96
+ ## 🙏 Acknowledgement
97
+ Our codebase builds upon the awesome [SparkVox](https://github.com/SparkAudio/SparkVox) and [DAC](https://github.com/descriptinc/descript-audio-codec). We thank the authors for their excellent work.
98
+
99
+ ## 🔖 Citation
100
+ If you find this work useful in your research, please consider citing:
101
+ ```bibtex
102
+ ```
103
+
104
+ ## 📜 License
105
+ This project is licensed under the Apache 2.0 License.
SAC-16k-37_5Hz.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16ff29d557d7cd6be36358d2694bbcd83a3d79c3766bff4d9a9e99ee5523fae2
3
+ size 2553952728
config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Soul-AILab/SAC-16k-37_5Hz",
3
+ "architectures": [
4
+ "CustomModel"
5
+ ],
6
+ "model_type": "custom",
7
+ "description": "SAC model at 37.5 Hz",
8
+ "sample_rate": 16000
9
+ }