DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Paper
•
2305.10005
•
Published
•
3
A novel self-supervised speech representation learning model combining masked language modeling with self-distillation and online clustering techniques. Achieves SOTA performance on various speech processing tasks.
Self-supervised speech representation learning (Wav2Vec2 architecture variant)
from transformers import Wav2Vec2ForPreTraining, Wav2Vec2FeatureExtractor
import torch
import librosa
# Load model components
model = Wav2Vec2ForPreTraining.from_pretrained("MohammadJRanjbar/DinoSR")
processor = Wav2Vec2FeatureExtractor.from_pretrained("MohammadJRanjbar/DinoSR")
# Process audio
audio, sr = librosa.load("speech.wav", sr=16000)
inputs = processor(audio, return_tensors="pt", sampling_rate=16000)
# Extract representations
with torch.no_grad():
outputs = model(**inputs)
speech_features = outputs.projected_states # [batch_size, seq_len, 256]
from transformers import Wav2Vec2ForCTC
model = Wav2Vec2ForCTC.from_pretrained(
"MohammadJRanjbar/DinoSR",
attention_dropout=0.1,
hidden_dropout=0.1,
layerdrop=0.1,
ctc_loss_reduction="mean"
)
# Freeze feature encoder
model.freeze_feature_encoder()
@article{liu2023dinosr,
title={DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning},
author={Liu, Alexander H and Chang, Heng-Jui and Auli, Michael and Hsu, Wei-Ning and Glass, James},
journal={arXiv preprint arXiv:2305.10005},
year={2023}
}
For questions and feedback:
This model card was generated using best practices from Model Card Creator