Model Card for esp-aves2-effnetb0-audioset

Model Details

Model Description

esp-aves2-effnetb0-audioset is a supervised audio encoder based on the EfficientNet-B0 CNN architecture, trained on AudioSet to produce transferable embeddings for downstream audio and bioacoustic tasks, as part of the experiments in What Matters for Bioacoustic Encoding.

  • Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
  • Funded by: More info at https://www.earthspecies.org/about-us#support
  • Shared by: Earth Species Project
  • Model type: Audio representation learning model (CNN; EfficientNet-B0 backbone)
  • License: CC-BY-NC-SA
  • Finetuned from model: EfficientNet-B0 pretrained on ImageNet (see Parent Models)

Model Sources

Parent Models

  1. EfficientNet-B0 (ImageNet)
    • Source: https://docs.pytorch.org/vision/main/models/generated/torchvision.models.efficientnet_b0.html
    • Description: ImageNet-pretrained EfficientNet-B0 initialization used before supervised training on AudioSet.
    • License: See upstream repository

Uses

Direct Use

esp-aves2-effnetb0-audioset can be used directly as an embedding model for audio and bioacoustic tasks such as species classification/detection, retrieval, clustering, individual ID, and repertoire analysis, especially when AudioSet-style general audio coverage is desirable.

Downstream Use

Use frozen embeddings with linear probes, or fine-tune on your target dataset. This model is also suitable as a general-purpose audio encoder within acoustic monitoring or audio understanding pipelines.

Out-of-Scope Use

  • Not a generative model; does not output text or waveforms.
  • Stand-alone decision-making in safety-critical contexts without additional validation is out of scope.

Bias, Risks, and Limitations

  • Bias: AudioSet is a web-scale, weakly labeled dataset and may reflect biases in online media (geography, species, recording conditions, noise types).
  • Risks: Potential misuse for surveillance or sensitive species monitoring without appropriate safeguards.
  • Limitations: Training/evaluation in the paper is standardized at 16 kHz; some taxa and high-frequency vocalizations may not be fully captured.

Recommendations

Validate on in-domain data before deployment. For long-term or conservation-sensitive monitoring, track performance over time and apply data governance and access control where appropriate.

How to Get Started with the Model

Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.

Installation

pip install avex

Or with uv:

uv add avex

For more details, see https://github.com/earthspecies/avex.

Loading the Model

from avex import load_model

model = load_model("esp_aves2_effnetb0_audioset", device="cuda")

Using the Model

# Case 1: embedding extraction (features only)
backbone = load_model("esp_aves2_effnetb0_audioset", device="cuda", return_features_only=True)

with torch.no_grad():
    embeddings = backbone(audio_tensor)
    # Shape: (batch, channels, height, width) for EfficientNet

# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=(2, 3))  # Shape: (batch, channels)

# Case 2: supervised predictions (logits over label IDs; see label_map.json)
model = load_model("esp_aves2_effnetb0_audioset", device="cuda")

with torch.no_grad():
    logits = model(audio_tensor)
    predicted_class = logits.argmax(dim=-1).item()

Transfer Learning with Probes

from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig

# Load backbone for feature extraction
base = load_model("esp_aves2_effnetb0_audioset", return_features_only=True, device="cuda")

# Define a probe head for your task
probe_config = ProbeConfig(
    probe_type="linear",
    target_layers=["last_layer"],
    aggregation="mean",
    freeze_backbone=True,
    online_training=True,
)

probe = build_probe_from_config(
    probe_config=probe_config,
    base_model=base,
    num_classes=10,  # Your number of classes
    device="cuda",
)

Class Label Mapping

The class label mapping for this supervised learning model can be found at label_map.json in the Hugging Face repository.

Training Details

Training Data

esp-aves2-effnetb0-audioset follows the paper’s supervised recipe using AudioSet only, starting from an ImageNet-pretrained EfficientNet-B0 backbone.

Training Data Sources

Dataset Description Source License Size
AudioSet general audio Link See dataset terms 5700 hours

Training Procedure

As described in the paper (for the AudioSet-only EfficientNet-B0 baseline):

  • Initialization: EfficientNet-B0 pretrained on ImageNet.
  • Supervised training: on AudioSet only with a multi-label objective.
  • Augmentations: random additive noise (p=0.5, SNR in [-5, 20] dB); mixup-style within-batch mixing (p=0.5) with union of labels.

Training Hyperparameters

Training hyperparameters are specified in train_config.yaml.

Evaluation

Testing Data, Factors & Metrics

Testing Data

When used as a backbone in the What Matters for Bioacoustic Encoding experiments, this model is evaluated on:

  • BEANS (classification and detection): https://github.com/earthspecies/beans
  • BirdSet (detection): https://huggingface.co/datasets/DBD-research-group/BirdSet
  • Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
  • Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale

Metrics

  • Linear probing: accuracy / mAP
  • Retrieval: ROC AUC
  • Clustering: NMI

Results

Aggregate results for linear probing (frozen base model) with esp-aves2-effnetb0-audioset are not yet finalized for this model card. Please refer to the latest version of the paper or future updates of this repository for a consolidated results table.

Environmental Impact

Not specified in the provided excerpt.

Technical Specifications

Model Architecture and Objective

esp-aves2-effnetb0-audioset uses an EfficientNet-B0 CNN operating on time-frequency representations (spectrograms), trained with supervised learning on AudioSet to learn general-purpose audio representations.

Key components:

  • Encoder: EfficientNet-B0
  • Feature extraction: mel-spectrogram frontend (16 kHz; exact spectrogram parameters follow the implementation in this repository)
  • Output: embeddings (dimension depends on backbone head configuration)

Compute Infrastructure

Not specified in the provided excerpt.

Model Configuration

Model configuration is available in train_config.yaml.

Citation

BibTeX:

@inproceedings{miron2025matters,
  title={What Matters for Bioacoustic Encoding},
  author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Model Card Contact

Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EarthSpeciesProject/esp-aves2-effnetb0-audioset

Paper for EarthSpeciesProject/esp-aves2-effnetb0-audioset