TraceGen Benchmark Leaderboard
Benchmark: TraceGen Evaluation Suite
We evaluate models on 5 environments using the official TraceGen metrics. Each environment reports MSE, MAE, and Endpoint MSE on held-out test sets.
Test on TraceGen benchmark
Use the official evaluation code provided in: https://github.com/jayLEE0301/TraceGen
Multi-GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
Single-GPU
export CUDA_VISIBLE_DEVICES=0
python test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
To reproduce the environment-specific benchmark results reported below,
users should evaluate the environment-specific checkpoints
TraceGen_{EnvName} from TraceGen Collection, which are trained using data from the corresponding environment only.
Metric definition. All reported errors are computed in a normalized coordinate space: both input images and predicted traces are scaled to the range [0, 1] prior to evaluation. Accordingly, the reported MSE, MAE, and Endpoint MSE reflect absolute errors within the normalized image space.
| Environment | Metric | TraceGen (×1e−2) |
|---|---|---|
| EpicKitchen | MSE | 0.445 |
| MAE | 2.721 | |
| Endpoint MSE | 0.791 | |
| Droid | MSE | 0.206 |
| MAE | 1.289 | |
| Endpoint MSE | 0.285 | |
| Bridge | MSE | 0.653 |
| MAE | 2.419 | |
| Endpoint MSE | 0.607 | |
| Libero | MSE | 0.276 |
| MAE | 1.442 | |
| Endpoint MSE | 0.385 | |
| Robomimic | MSE | 0.138 |
| MAE | 1.416 | |
| Endpoint MSE | 0.151 |
Submitting to the Leaderboard
- Use the provided evaluation script:
https://github.com/jayLEE0301/TraceGen - Report metrics on the official test split, using the corresponding dataset from:
https://huggingface.co/collections/furonghuang-lab/tracegen - For environment-specific results, evaluate the corresponding
TraceGen_{EnvName}checkpoint. - Open a PR or submit results via GitHub Issues.
