Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction Paper • 2512.18880 • Published 28 days ago • 24
Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design Paper • 2508.17573 • Published Aug 25, 2025 • 1
Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems Paper • 2505.18139 • Published May 23, 2025
JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community Paper • 2503.21679 • Published Mar 27, 2025 • 1
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation Paper • 2503.10497 • Published Mar 13, 2025 • 2
The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents Paper • 2601.07264 • Published 7 days ago • 22
The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents Paper • 2601.07264 • Published 7 days ago • 22
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction Paper • 2512.18880 • Published 28 days ago • 24
The Invisible Leash: Why RLVR May Not Escape Its Origin Paper • 2507.14843 • Published Jul 20, 2025 • 85
Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design Paper • 2508.17573 • Published Aug 25, 2025 • 1
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Paper • 2504.15521 • Published Apr 22, 2025 • 64