---
license: apache-2.0
language:
- zh
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
pipeline_tag: question-answering
---
# From Faithfulness to Correctness: Generative Reward Models that Think Critically

[[📜 Paper](https://arxiv.org/abs/2509.25409)] [[🖥️ Code](https://github.com/Martin-qyma/TRM)] [[🤗 Hugging Face](https://huggingface.co/QiyaoMa/TRM)]

In this repository, we introduce the **Thinking-supervised Reward Model (TRM)**: a sentence-level generative reward model that equips language models with *critical thinking* abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.

## Thinking-supervised Reward Model (TRM)
Given a query, answer, and supporting documents, TRM first evaluates the faithfulness of each answer sentence to the provided evidence. Based on this faithfulness assessment, TRM then applies a step-by-step reasoning framework to judge sentence-level correctness, explicitly modeling how each reasoning step aligns with both the external sources and the internal logic of the answer.
<img src='Reward Model.png' />

## Policy Optimization
TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.
<img src='Policy Model.png' />

## Getting Started
Please follow instructions in [https://github.com/Martin-qyma/TRM](https://github.com/Martin-qyma/TRM) for detailed implementation.