TurkishCodeMan commited on
Commit
fe328ff
·
verified ·
1 Parent(s): 76fc900

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -58
README.md CHANGED
@@ -1,72 +1,126 @@
1
  ---
 
2
  base_model: unsloth/csm-1b
3
  tags:
4
- - text-to-speech
5
- - transformers
6
- - unsloth
7
- - csm
8
- - trl
9
- - lora
10
- - finetuning
11
- license: apache-2.0
12
- language:
13
- - en
14
- datasets:
15
- - TurkishCodeMan/tts-medium-clean
16
- pipeline_tag: text-to-speech
17
  ---
18
 
19
- # TurkishCodeMan - CSM-1B (LoRA Fine-tuned)
20
 
21
- ## 📌 Model Summary
22
- This is a **LoRA fine-tuned** version of [unsloth/csm-1b](https://huggingface.co/unsloth/csm-1b), trained for **text-to-speech (TTS)** tasks.
23
- The model was trained using [Unsloth](https://github.com/unslothai/unsloth) for 2x faster finetuning and Hugging Face’s [TRL](https://huggingface.co/docs/trl/index) library.
24
 
25
- - **Base Model:** `unsloth/csm-1b`
26
- - **Fine-tuning Method:** LoRA
27
- - **Training Frameworks:** Unsloth, TRL
28
- - **Dataset:** [TurkishCodeMan/tts-medium-clean](https://huggingface.co/datasets/TurkishCodeMan/tts-medium-clean)
29
- - **Languages:** English, Turkish
30
- - **License:** Apache-2.0
31
 
32
- ---
33
 
34
- ## 🚀 Intended Use
35
- - Convert text to high-quality speech.
36
- - Research and experimentation in TTS models.
37
- - Transfer learning and downstream fine-tuning.
38
 
39
- ⚠️ **Not intended** for harmful or malicious use (hate speech, deepfakes, etc.).
 
 
40
 
41
- ---
42
-
43
- ## 🛠️ Training Details
44
- - **Method:** LoRA low-rank adaptation on transformer layers.
45
- - **Batch Size:** 16 (8 × gradient_accumulation=2).
46
- - **Epochs:** 3
47
- - **Trainable Parameters:** ~29M of 1.66B (≈1.75% trained).
48
- - **Hardware:** 1x GPU.
49
- - **Optimizer:** AdamW.
50
- - **Learning Rate Schedule:** Linear decay with warmup.
51
-
52
- ---
53
 
54
- ## 📊 Dataset
55
- The model was fine-tuned on **[TurkishCodeMan/tts-medium-clean](https://huggingface.co/datasets/TurkishCodeMan/tts-medium-clean)**.
56
- This dataset contains clean speech-text pairs suitable for TTS tasks.
57
-
58
- ---
59
-
60
- ## 🔧 How to Use
61
 
62
  ```python
63
- from transformers import AutoModelForCausalLM, AutoTokenizer
64
-
65
- model = AutoModelForCausalLM.from_pretrained("TurkishCodeMan/csm-1b-tts-lora")
66
- tokenizer = AutoTokenizer.from_pretrained("TurkishCodeMan/csm-1b-tts-lora")
67
-
68
- text = "Hi !"
69
- inputs = tokenizer(text, return_tensors="pt")
70
-
71
- outputs = model.generate(**inputs, max_new_tokens=200)
72
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  base_model: unsloth/csm-1b
4
  tags:
5
+ - unsloth
6
+ - peft
7
+ - lora
8
+ - text-to-speech
9
+ - speech
10
+ - audio
11
+ library_name: transformers
 
 
 
 
 
 
12
  ---
13
 
14
+ # TurkishCodeMan/csm-1b-lora-fft
15
 
16
+ Bu repo, `unsloth/csm-1b` taban modeli üzerinde **LoRA (PEFT) fine-tune** ile oluşturulmuş bir adaptördür.
17
+ Eğitim, Unsloth + Transformers ile yapılmıştır.
 
18
 
19
+ ## Model Özeti
20
+ - **Base model:** `unsloth/csm-1b`
21
+ - **Fine-tuning:** LoRA (PEFT)
22
+ - **Örnek kullanım:** Referans ses + metin ile konuşma üretimi (CSM chat template)
 
 
23
 
24
+ > Not: Bu repo LoRA adaptörü içerir. Inference için base model + bu adaptörü birlikte yüklemelisiniz.
25
 
26
+ ## Kurulum
 
 
 
27
 
28
+ ```
29
+ pip install -U "transformers>=4.52.0" accelerate peft soundfile
30
+ ```
31
 
32
+ ## Inference (Referans ses ile)
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ Aşağıdaki örnek, bir `wav` referans ses ve hedef metin ile audio üretir.
 
 
 
 
 
 
35
 
36
  ```python
37
+ import numpy as np
38
+ import torch
39
+ import soundfile as sf
40
+ from transformers import AutoProcessor
41
+ from peft import PeftModel
42
+ from transformers import CsmForConditionalGeneration
43
+
44
+ device = "cuda" if torch.cuda.is_available() else "cpu"
45
+ sampling_rate = 24_000
46
+
47
+ base_id = "unsloth/csm-1b"
48
+ adapter_id = "TurkishCodeMan/csm-1b-lora-fft"
49
+
50
+ processor = AutoProcessor.from_pretrained(base_id)
51
+ base = CsmForConditionalGeneration.from_pretrained(base_id, torch_dtype="auto").to(device)
52
+ model = PeftModel.from_pretrained(base, adapter_id).to(device)
53
+ model.eval()
54
+
55
+ def _resample_linear(audio: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
56
+ if orig_sr == target_sr:
57
+ return audio
58
+ if audio.ndim == 2:
59
+ audio = audio.mean(axis=1)
60
+ n = audio.shape[0]
61
+ new_n = int(round(n * (target_sr / orig_sr)))
62
+ if new_n <= 1:
63
+ return audio[:1].astype(np.float32)
64
+ x_old = np.linspace(0.0, 1.0, num=n, endpoint=True)
65
+ x_new = np.linspace(0.0, 1.0, num=new_n, endpoint=True)
66
+ return np.interp(x_new, x_old, audio).astype(np.float32)
67
+
68
+ # Reference audio (wav path)
69
+ ref_path = "reference.wav"
70
+ ref_audio, ref_sr = sf.read(ref_path, dtype="float32")
71
+ if ref_audio.ndim == 2:
72
+ ref_audio = ref_audio.mean(axis=1).astype(np.float32)
73
+ if ref_sr != sampling_rate:
74
+ ref_audio = _resample_linear(ref_audio, ref_sr, sampling_rate)
75
+
76
+ ref_text = "Reference transcript (optional)."
77
+ target_text = "We extend the standard NIAH task, to investigate model behavior in previously underexplored settings."
78
+
79
+ speaker_role = "0"
80
+ conversation = [
81
+ {
82
+ "role": speaker_role,
83
+ "content": [
84
+ {"type": "text", "text": "Please speak english\n\n" + ref_text},
85
+ {"type": "audio", "audio": ref_audio},
86
+ ],
87
+ },
88
+ {
89
+ "role": speaker_role,
90
+ "content": [
91
+ {"type": "text", "text": target_text},
92
+ ],
93
+ },
94
+ ]
95
+
96
+ inputs = processor.apply_chat_template(
97
+ conversation,
98
+ tokenize=True,
99
+ return_dict=True,
100
+ return_tensors="pt",
101
+ ).to(device)
102
+
103
+ with torch.no_grad():
104
+ out = model.generate(
105
+ **inputs,
106
+ output_audio=True,
107
+ max_new_tokens=200,
108
+ depth_decoder_temperature=0.6,
109
+ depth_decoder_top_k=0,
110
+ depth_decoder_top_p=0.7,
111
+ temperature=0.3,
112
+ top_k=50,
113
+ top_p=1.0,
114
+ )
115
+
116
+ generated_audio = out[0].detach().cpu().to(torch.float32).numpy()
117
+ sf.write("generated_audio.wav", generated_audio, samplerate=sampling_rate)
118
+ print("Wrote generated_audio.wav")
119
+ ```
120
+
121
+ ## Eğitim Notları
122
+ - Audio girişleri 24kHz olarak hazırlanmalıdır (mono önerilir).
123
+ - Dataset pipeline’ında sesler sabit uzunluğa pad/trim edilerek batch hataları önlenir.
124
+
125
+ ## Lisans
126
+ Base model lisansı ve veri seti lisansı geçerlidir. Bu repo adaptör ağırlıklarını içerir.