--- base_model: - mistralai/Mistral-7B-Instruct-v0.2 datasets: - mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3 license: apache-2.0 library_name: transformers pipeline_tag: text-generation --- # RefAlign: RL with Similarity-based Rewards **GitHub repository**: https://github.com/mzhaoshuai/RefAlign **Paper**: [Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data](https://huggingface.co/papers/2504.09895). The training data is [mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3](https://huggingface.co/datasets/mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3). When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore. | Hyper-Parameters | Value | |:---------------------------------------------------------|--------------------------------------------------------| |LR|8e-7| |Batch Size| 512 | |Epoch| 1 | |Prompt Length| 400 | |Generation Length|800| |Advantage CLIP|0.5| |Sampled Generations (K)|2| |BertScore Model|bart-large-mnli|