TurkishCodeMan
/

csm-1b-lora-fft

@@ -1,72 +1,126 @@
 ---
 base_model: unsloth/csm-1b
 tags:
-- text-to-speech
-- transformers
-- unsloth
-- csm
-- trl
-- lora
-- finetuning
-license: apache-2.0
-language:
-- en
-datasets:
-- TurkishCodeMan/tts-medium-clean
-pipeline_tag: text-to-speech
 ---
-# TurkishCodeMan - CSM-1B (LoRA Fine-tuned)
-## 📌 Model Summary
-This is a **LoRA fine-tuned** version of [unsloth/csm-1b](https://huggingface.co/unsloth/csm-1b), trained for **text-to-speech (TTS)** tasks.
-The model was trained using [Unsloth](https://github.com/unslothai/unsloth) for 2x faster finetuning and Hugging Face’s [TRL](https://huggingface.co/docs/trl/index) library.
-- **Base Model:** `unsloth/csm-1b`
-- **Fine-tuning Method:** LoRA
-- **Training Frameworks:** Unsloth, TRL
-- **Dataset:** [TurkishCodeMan/tts-medium-clean](https://huggingface.co/datasets/TurkishCodeMan/tts-medium-clean)
-- **Languages:** English, Turkish
-- **License:** Apache-2.0
----
-## 🚀 Intended Use
-- Convert text to high-quality speech.
-- Research and experimentation in TTS models.
-- Transfer learning and downstream fine-tuning.
-⚠️ **Not intended** for harmful or malicious use (hate speech, deepfakes, etc.).
----
-## 🛠️ Training Details
-- **Method:** LoRA low-rank adaptation on transformer layers.
-- **Batch Size:** 16 (8 × gradient_accumulation=2).
-- **Epochs:** 3
-- **Trainable Parameters:** ~29M of 1.66B (≈1.75% trained).
-- **Hardware:** 1x GPU.
-- **Optimizer:** AdamW.
-- **Learning Rate Schedule:** Linear decay with warmup.
----
-## 📊 Dataset
-The model was fine-tuned on **[TurkishCodeMan/tts-medium-clean](https://huggingface.co/datasets/TurkishCodeMan/tts-medium-clean)**.
-This dataset contains clean speech-text pairs suitable for TTS tasks.
----
-## 🔧 How to Use
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("TurkishCodeMan/csm-1b-tts-lora")
-tokenizer = AutoTokenizer.from_pretrained("TurkishCodeMan/csm-1b-tts-lora")
-text = "Hi !"
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=200)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 ---
+license: apache-2.0
 base_model: unsloth/csm-1b
 tags:
+  - unsloth
+  - peft
+  - lora
+  - text-to-speech
+  - speech
+  - audio
+library_name: transformers
 ---
+# TurkishCodeMan/csm-1b-lora-fft
+Bu repo, `unsloth/csm-1b` taban modeli üzerinde **LoRA (PEFT) fine-tune** ile oluşturulmuş bir adaptördür.
+Eğitim, Unsloth + Transformers ile yapılmıştır.
+## Model Özeti
+- **Base model:** `unsloth/csm-1b`
+- **Fine-tuning:** LoRA (PEFT)
+- **Örnek kullanım:** Referans ses + metin ile konuşma üretimi (CSM chat template)
+> Not: Bu repo LoRA adaptörü içerir. Inference için base model + bu adaptörü birlikte yüklemelisiniz.
+## Kurulum
+```
+pip install -U "transformers>=4.52.0" accelerate peft soundfile
+```
+## Inference (Referans ses ile)
+Aşağıdaki örnek, bir `wav` referans ses ve hedef metin ile audio üretir.
 ```python
+import numpy as np
+import torch
+import soundfile as sf
+from transformers import AutoProcessor
+from peft import PeftModel
+from transformers import CsmForConditionalGeneration
+device = "cuda" if torch.cuda.is_available() else "cpu"
+sampling_rate = 24_000
+base_id = "unsloth/csm-1b"
+adapter_id = "TurkishCodeMan/csm-1b-lora-fft"
+processor = AutoProcessor.from_pretrained(base_id)
+base = CsmForConditionalGeneration.from_pretrained(base_id, torch_dtype="auto").to(device)
+model = PeftModel.from_pretrained(base, adapter_id).to(device)
+model.eval()
+def _resample_linear(audio: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
+    if orig_sr == target_sr:
+        return audio
+    if audio.ndim == 2:
+        audio = audio.mean(axis=1)
+    n = audio.shape[0]
+    new_n = int(round(n * (target_sr / orig_sr)))
+    if new_n <= 1:
+        return audio[:1].astype(np.float32)
+    x_old = np.linspace(0.0, 1.0, num=n, endpoint=True)
+    x_new = np.linspace(0.0, 1.0, num=new_n, endpoint=True)
+    return np.interp(x_new, x_old, audio).astype(np.float32)
+# Reference audio (wav path)
+ref_path = "reference.wav"
+ref_audio, ref_sr = sf.read(ref_path, dtype="float32")
+if ref_audio.ndim == 2:
+    ref_audio = ref_audio.mean(axis=1).astype(np.float32)
+if ref_sr != sampling_rate:
+    ref_audio = _resample_linear(ref_audio, ref_sr, sampling_rate)
+ref_text = "Reference transcript (optional)."
+target_text = "We extend the standard NIAH task, to investigate model behavior in previously underexplored settings."
+speaker_role = "0"
+conversation = [
+    {
+        "role": speaker_role,
+        "content": [
+            {"type": "text", "text": "Please speak english\n\n" + ref_text},
+            {"type": "audio", "audio": ref_audio},
+        ],
+    },
+    {
+        "role": speaker_role,
+        "content": [
+            {"type": "text", "text": target_text},
+        ],
+    },
+]
+inputs = processor.apply_chat_template(
+    conversation,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(device)
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        output_audio=True,
+        max_new_tokens=200,
+        depth_decoder_temperature=0.6,
+        depth_decoder_top_k=0,
+        depth_decoder_top_p=0.7,
+        temperature=0.3,
+        top_k=50,
+        top_p=1.0,
+    )
+generated_audio = out[0].detach().cpu().to(torch.float32).numpy()
+sf.write("generated_audio.wav", generated_audio, samplerate=sampling_rate)
+print("Wrote generated_audio.wav")
+```
+## Eğitim Notları
+- Audio girişleri 24kHz olarak hazırlanmalıdır (mono önerilir).
+- Dataset pipeline’ında sesler sabit uzunluğa pad/trim edilerek batch hataları önlenir.
+## Lisans
+Base model lisansı ve veri seti lisansı geçerlidir. Bu repo adaptör ağırlıklarını içerir.