--- license: apache-2.0 language: - zh base_model: - deepseek-ai/DeepSeek-R1-Distill-Qwen-32B pipeline_tag: question-answering --- # From Faithfulness to Correctness: Generative Reward Models that Think Critically [[📜 Paper](https://arxiv.org/abs/2509.25409)] [[🖥️ Code](https://github.com/Martin-qyma/TRM)] [[🤗 Hugging Face](https://huggingface.co/QiyaoMa/TRM)] In this repository, we introduce the **Thinking-supervised Reward Model (TRM)**: a sentence-level generative reward model that equips language models with *critical thinking* abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents. ## Thinking-supervised Reward Model (TRM) Given a query, answer, and supporting documents, TRM first evaluates the faithfulness of each answer sentence to the provided evidence. Based on this faithfulness assessment, TRM then applies a step-by-step reasoning framework to judge sentence-level correctness, explicitly modeling how each reasoning step aligns with both the external sources and the internal logic of the answer. ## Policy Optimization TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness. ## Getting Started Please follow instructions in [https://github.com/Martin-qyma/TRM](https://github.com/Martin-qyma/TRM) for detailed implementation.