Mistral-7B-Instruct-v0.2-refalign / README.md

mzhaoshuai

Update README.md

d5354d9 verified 2 months ago

preview code

raw

history blame contribute delete

1.16 kB

metadata

base_model:
  - mistralai/Mistral-7B-Instruct-v0.2
datasets:
  - mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.

When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.

Hyper-Parameters	Value
LR	8e-7
Batch Size	512
Epoch	1
Prompt Length	400
Generation Length	800
Advantage CLIP	0.5
Sampled Generations (K)	2
BertScore Model	bart-large-mnli