Introducing v0.5 of the AI Safety Benchmark from MLCommons Paper • 2404.12241 • Published Apr 18, 2024 • 13
Improving Text-to-Image Consistency via Automatic Prompt Optimization Paper • 2403.17804 • Published Mar 26, 2024 • 20
XNLI: Evaluating Cross-lingual Sentence Representations Paper • 1809.05053 • Published Sep 13, 2018 • 1
Adversarial NLI: A New Benchmark for Natural Language Understanding Paper • 1910.14599 • Published Oct 31, 2019
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Paper • 2204.03162 • Published Apr 7, 2022 • 1
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset Paper • 2205.09209 • Published May 18, 2022
Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks Paper • 2204.01906 • Published Apr 5, 2022
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition Paper • 2004.03066 • Published Apr 7, 2020
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking Paper • 2106.06052 • Published May 21, 2021
The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks Paper • 2310.17514 • Published Oct 26, 2023 • 1
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference Paper • 1704.05426 • Published Apr 18, 2017 • 1
Llama 2: Open Foundation and Fine-Tuned Chat Models Paper • 2307.09288 • Published Jul 18, 2023 • 248