mzhaoshuai's picture
Update README.md
d5354d9 verified
---
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
datasets:
- mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---
# RefAlign: RL with Similarity-based Rewards
**GitHub repository**: https://github.com/mzhaoshuai/RefAlign
**Paper**: [Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data](https://huggingface.co/papers/2504.09895).
The training data is [mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3](https://huggingface.co/datasets/mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3).
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.
| Hyper-Parameters | Value |
|:---------------------------------------------------------|--------------------------------------------------------|
|LR|8e-7|
|Batch Size| 512 |
|Epoch| 1 |
|Prompt Length| 400 |
|Generation Length|800|
|Advantage CLIP|0.5|
|Sampled Generations (K)|2|
|BertScore Model|bart-large-mnli|