|
|
--- |
|
|
base_model: |
|
|
- mistralai/Mistral-7B-Instruct-v0.2 |
|
|
datasets: |
|
|
- mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3 |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# RefAlign: RL with Similarity-based Rewards |
|
|
|
|
|
**GitHub repository**: https://github.com/mzhaoshuai/RefAlign |
|
|
|
|
|
**Paper**: [Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data](https://huggingface.co/papers/2504.09895). |
|
|
|
|
|
The training data is [mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3](https://huggingface.co/datasets/mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3). |
|
|
|
|
|
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore. |
|
|
|
|
|
| Hyper-Parameters | Value | |
|
|
|:---------------------------------------------------------|--------------------------------------------------------| |
|
|
|LR|8e-7| |
|
|
|Batch Size| 512 | |
|
|
|Epoch| 1 | |
|
|
|Prompt Length| 400 | |
|
|
|Generation Length|800| |
|
|
|Advantage CLIP|0.5| |
|
|
|Sampled Generations (K)|2| |
|
|
|BertScore Model|bart-large-mnli| |