metadata
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
datasets:
- mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
RefAlign: RL with Similarity-based Rewards
GitHub repository: https://github.com/mzhaoshuai/RefAlign
The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.
| Hyper-Parameters | Value |
|---|---|
| LR | 8e-7 |
| Batch Size | 512 |
| Epoch | 1 |
| Prompt Length | 400 |
| Generation Length | 800 |
| Advantage CLIP | 0.5 |
| Sampled Generations (K) | 2 |
| BertScore Model | bart-large-mnli |