Model Card for esp-aves2-sl-beats-bio

Model Details

Model Description

esp-aves2-sl-beats-bio is an audio representation learning model (bioacoustic encoder) built by supervised post-training a self-supervised BEATs backbone on a curated Bio bioacoustic corpus. It is described in What Matters for Bioacoustic Encoding.

  • Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
  • Funded by: More info at https://www.earthspecies.org/about-us#support
  • Shared by: Earth Species Project
  • Model type: Audio representation learning model (Transformer; BEATs backbone)
  • License: CC-BY-NC-SA
  • Finetuned from model: BEATs pretrained on AudioSet (see Parent Models)

Model Sources

Parent Models

  1. BEATs (pretrained on AudioSet)
    • Source: https://github.com/microsoft/unilm/tree/master/beats
    • Description: Self-supervised transformer audio encoder used as the base SSL checkpoint.
    • License: See upstream repository

Uses

Direct Use

esp-aves2-sl-beats-bio can be used directly for bioacoustic tasks such as species classification and detection, retrieval and clustering, and as a feature extractor for individual ID and repertoire analysis.

Downstream Use

Use as a frozen encoder with linear probes, or fine-tune on target datasets (taxa-, habitat-, or device-specific). It can also be integrated into monitoring pipelines as an embedding model.

Out-of-Scope Use

Not a generative model; does not output text. Stand-alone classification without probes/finetuning is out of scope.

Bias, Risks, and Limitations

  • Bias: Citizen-science corpora can bias toward frequently recorded taxa and regions.
  • Risks: Potential misuse in sensitive wildlife contexts; apply safeguards and access control.
  • Limitations: The paper uses 16 kHz for standardized comparison; performance may shift with different bandwidth and recording conditions.

Recommendations

Validate on in-domain data before deployment; for sensitive taxa, follow conservation and data-governance practices.

How to Get Started with the Model

Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.

Installation

pip install avex

Or with uv:

uv add avex

For more details, see https://github.com/earthspecies/avex.

Loading the Model

from avex import load_model

model = load_model("esp_aves2_sl_beats_bio", device="cuda")

Using the Model

# Case 1: embedding extraction (features only)
backbone = load_model("esp_aves2_sl_beats_bio", device="cuda", return_features_only=True)

with torch.no_grad():
    embeddings = backbone(audio_tensor)
    # Shape: (batch, time_steps, 768) for BEATs

# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1)  # Shape: (batch, 768)

# Case 2: supervised predictions (logits over label IDs; see label_map.json)
model = load_model("esp_aves2_sl_beats_bio", device="cuda")

with torch.no_grad():
    logits = model(audio_tensor)
    predicted_class = logits.argmax(dim=-1).item()

Transfer Learning with Probes

from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig

# Load backbone for feature extraction
base = load_model("esp_aves2_sl_beats_bio", return_features_only=True, device="cuda")

# Define a probe head for your task
probe_config = ProbeConfig(
    probe_type="linear",
    target_layers=["last_layer"],
    aggregation="mean",
    freeze_backbone=True,
    online_training=True,
)

probe = build_probe_from_config(
    probe_config=probe_config,
    base_model=base,
    num_classes=10,  # Your number of classes
    device="cuda",
)

Class Label Mapping

The class label mapping for this supervised learning model can be found at label_map.json in the Hugging Face repository.

Training Details

Training Data

This -bio checkpoint is produced by supervised post-training on Bioacoustic data only (the Bio mix).

The starting point is a BEATs backbone pretrained on AudioSet (see Parent Models).

Training Data Sources

Dataset Description Source License Size
Xeno-canto birds Link CC (varies) 10416 hours
iNaturalist diverse taxa Link CC (varies) 1539 hours
Watkins marine mammals Link licensing agreement (paper) 27 hours
Animal Sound Archive diverse taxa Link See archive terms 78 hours

Training Procedure

  • Supervised post-training (SL): on Bio only.
  • Augmentations: random additive noise (p=0.5, SNR in ([-10, 20]) dB); mixup-style within-batch mixing (p=0.5) with union of labels.

Training Hyperparameters

Training hyperparameters are specified in train_config.yaml.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The paper evaluates on:

  • BEANS (classification and detection): https://github.com/earthspecies/beans
  • BirdSet (detection): https://huggingface.co/datasets/DBD-research-group/BirdSet
  • Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
  • Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale

Metrics

  • Linear probing: accuracy / mAP
  • Retrieval: ROC AUC
  • Clustering: NMI

Results

Aggregate results for linear probing (frozen base model) with esp-aves2-sl-beats-bio (from the provided LaTeX table):

Benchmark Task Metric Score
BEANS Classification Probe Accuracy 0.840
BEANS Classification Retrieval ROC AUC 0.811
BEANS Classification Clustering NMI 0.594
BEANS Detection Probe mAP 0.390
BEANS Detection Retrieval ROC AUC 0.719
BirdSet Probe mAP 0.288
BirdSet Retrieval ROC AUC 0.726
Individual ID Probe Accuracy 0.484
Individual ID Retrieval ROC AUC 0.681
Vocal Repertoire Retrieval ROC AUC 0.789
Vocal Repertoire Clustering NMI 0.516

Environmental Impact

Not specified in the provided excerpt.

Technical Specifications

Model Architecture and Objective

esp-aves2-sl-beats-bio uses a BEATs transformer encoder with SSL pretraining (AudioSet) and supervised post-training on Bio to learn robust bioacoustic representations.

Key components:

  • Encoder: BEATs transformer
  • Feature extraction: time-series embeddings pooled for evaluation
  • Output: embeddings (dimension depends on backbone configuration)

Compute Infrastructure

Not specified in the provided excerpt.

Model Configuration

Model configuration is available in train_config.yaml.

Citation

BibTeX:

@inproceedings{miron2025matters,
  title={What Matters for Bioacoustic Encoding},
  author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

APA:

Miron, M., Robinson, D., Alizadeh, M., et al. (2025). What Matters for Bioacoustic Encoding. arXiv preprint arXiv:2508.11845.

More Information

  • Issue tracker: https://github.com/earthspecies/avex/issues

Model Card Contact

Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EarthSpeciesProject/esp-aves2-sl-beats-bio

Paper for EarthSpeciesProject/esp-aves2-sl-beats-bio