Update README.md
Browse files
README.md
CHANGED
|
@@ -9,30 +9,24 @@ tags:
|
|
| 9 |
licence: license
|
| 10 |
datasets:
|
| 11 |
- Jennny/helpsteer2-helpfulness-preference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
|
|
|
| 13 |
|
| 14 |
# Model Card for qwen3-0.6b-RM-hs2
|
| 15 |
|
| 16 |
This model is a fine-tuned version of [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base).
|
| 17 |
It has been trained using [TRL](https://github.com/huggingface/trl).
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
## Quick start
|
| 20 |
-
|
| 21 |
-
```python
|
| 22 |
-
from transformers import pipeline
|
| 23 |
-
|
| 24 |
-
text = "The capital of France is Paris."
|
| 25 |
-
rewarder = pipeline(model="sorakritt/qwen3-0.6b-RM-hs2", device="cuda")
|
| 26 |
-
output = rewarder(text)[0]
|
| 27 |
-
print(output["score"])
|
| 28 |
-
```
|
| 29 |
|
| 30 |
## Training procedure
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
This model was trained using prompts with a chosen response >=3 only. It took about 1h 20mins with an A100(40 GB).
|
| 36 |
|
| 37 |
### Framework versions
|
| 38 |
|
|
|
|
| 9 |
licence: license
|
| 10 |
datasets:
|
| 11 |
- Jennny/helpsteer2-helpfulness-preference
|
| 12 |
+
- nvidia/HelpSteer2
|
| 13 |
+
license: mit
|
| 14 |
+
language:
|
| 15 |
+
- en
|
| 16 |
+
pipeline_tag: text-classification
|
| 17 |
---
|
| 18 |
+
<a href="https://aiplans.org" target="_blank" style="margin: 2px;"> <img alt="AIPlans" src="./logos/AI-Plans.svg" style="display: inline-block; vertical-align: middle;"/> </a>
|
| 19 |
|
| 20 |
# Model Card for qwen3-0.6b-RM-hs2
|
| 21 |
|
| 22 |
This model is a fine-tuned version of [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base).
|
| 23 |
It has been trained using [TRL](https://github.com/huggingface/trl).
|
| 24 |
+
Intended Use: Research on model diffing, preference fine-tuning, and evaluation of lightweight LLM behavior changes.
|
| 25 |
+
It was developed for use in the Model Diffing project of AI-Plans.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Training procedure
|
| 29 |
+
This model is a reward model and was trained using prompts with a chosen response >=3 only. It took about 1h 20mins with an A100(40 GB).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
### Framework versions
|
| 32 |
|