rodrigomt
/

quem-V2-4b

@@ -12,19 +12,47 @@ tags:
 - quelmap/Lightning-4b
 - Intel/hebrew-math-tutor-v1
 - GetSoloTech/Qwen3-Code-Reasoning-4B
 ---
-# quem-v2-4b
-quem-v2-4b is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
 * [janhq/Jan-v1-2509](https://huggingface.co/janhq/Jan-v1-2509)
 * [quelmap/Lightning-4b](https://huggingface.co/quelmap/Lightning-4b)
 * [Intel/hebrew-math-tutor-v1](https://huggingface.co/Intel/hebrew-math-tutor-v1)
 * [GetSoloTech/Qwen3-Code-Reasoning-4B](https://huggingface.co/GetSoloTech/Qwen3-Code-Reasoning-4B)
-## 🧩 Configuration
-yaml
 models:
   - model: janhq/Jan-v1-2509
     parameters:
@@ -55,26 +83,101 @@ parameters:
 device: auto
 dtype: bfloat16
-## 💻 Usage
-python
-!pip install -qU transformers accelerate
-from transformers import AutoTokenizer
-import transformers
 import torch
-model = "rodrigomt/quem-v2-4b"
-messages = [{"role": "user", "content": "What is a large language model?"}]
-tokenizer = AutoTokenizer.from_pretrained(model)
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-pipeline = transformers.pipeline(
     "text-generation",
-    model=model,
     torch_dtype=torch.float16,
     device_map="auto",
 )
-outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
-print(outputs[0]["generated_text"])

 - quelmap/Lightning-4b
 - Intel/hebrew-math-tutor-v1
 - GetSoloTech/Qwen3-Code-Reasoning-4B
+language:
+- en
+- pt
+pipeline_tag: text-generation
 ---
+# 🤖 quem-4b v2
+A 4-billion parameter merged language model built on the **Qwen3** family. **quem-v2-4b** blends four complementary models using **LazyMergekit** with the **DARE-TIES** method to deliver a compact, versatile model for instruction following, coding assistance, and reasoning.
+## 📋 Overview
+**quem-v2-4b** is a carefully balanced merge of four specialized 4B-class models. Using **DARE-TIES** with equal weights, it aims to retain strengths across general conversation (Jan), fast responses (Lightning), mathematical reasoning (Hebrew Math Tutor), and code reasoning (Qwen3 Code Reasoning).
+### ✨ Key Features
+* **Balanced Merge:** Equal weights (25% each) for stability across skills.
+* **Reasoning & Code:** Improved chain-of-thought style reasoning and code understanding from contributor models.
+* **Compact & Efficient:** 4B parameters for fast inference on a single consumer GPU.
+* **Instruction-Tuned:** Works out-of-the-box with standard chat prompts via the HF chat template.
+---
+## 🔧 Base Models
 * [janhq/Jan-v1-2509](https://huggingface.co/janhq/Jan-v1-2509)
 * [quelmap/Lightning-4b](https://huggingface.co/quelmap/Lightning-4b)
 * [Intel/hebrew-math-tutor-v1](https://huggingface.co/Intel/hebrew-math-tutor-v1)
 * [GetSoloTech/Qwen3-Code-Reasoning-4B](https://huggingface.co/GetSoloTech/Qwen3-Code-Reasoning-4B)
+All contributions are merged on top of a Qwen3 base (see configuration below).
+---
+## 🛠️ Merge Method & Configuration
+The merge was performed using **[LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing)**, ensuring a harmonious integration of the different specializations.
+### Merge YAML (LazyMergekit)
+```yaml
 models:
   - model: janhq/Jan-v1-2509
     parameters:
 device: auto
 dtype: bfloat16
+```
+---
+## 💻 Usage (Transformers)
+Install:
+```bash
+pip install -U transformers accelerate torch
+```
+Minimal chat example:
+```python
+from transformers import AutoTokenizer, pipeline
 import torch
+model_id = "rodrigomt/quem-v2-4b"
+messages = [
+    {"role": "user", "content": "What is a large language model?"}
+]
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+pipe = pipeline(
     "text-generation",
+    model=model_id,
     torch_dtype=torch.float16,
     device_map="auto",
 )
+out = pipe(
+    prompt,
+    max_new_tokens=256,
+    do_sample=True,
+    temperature=0.7,
+    top_k=50,
+    top_p=0.95,
+)
+print(out[0]["generated_text"])
+```
+### Prompting Tips
+* Use standard **system / user / assistant** chat structure.
+* For coding tasks, include concise requirements, desired language, and constraints.
+* For math/logic tasks, allow slightly higher `max_new_tokens` and consider lower temperature (e.g., `temperature=0.3–0.5`) for more deterministic reasoning.
+---
+## ⚙️ Inference Notes
+* **Precision:** Default `bfloat16` (bf16); `float16` also works well on most GPUs.
+* **Quantization:** 4-bit/8-bit quantization via `bitsandbytes` or `auto-gptq` can reduce memory; expect some quality trade-offs.
+* **Decoding:**
+  * General chat: `temperature=0.7`, `top_p=0.9–0.95`, `max_new_tokens=256`.
+  * Code/Math: lower temperature (`0.2–0.5`), optionally increase `max_new_tokens` to 512–1024 for step-by-step reasoning.
+---
+## 🧪 Evaluation
+No unified public benchmark is included in this release. Early local testing indicates improved step-by-step reasoning compared to the prior 4B merge on similar hardware, but results are **highly sensitive** to decoding parameters and prompts. Community PRs with reproducible evals (Arena/AlpacaEval/HELM/OpenLLM Leaderboards/LocalAIMe) are welcome.
+---
+## 🖥️ System Requirements
+**Minimum (single GPU):**
+* RAM: 16 GB
+* VRAM: 8 GB (e.g., RTX 3060 Ti / 3070 class)
+* Storage: ~20 GB free
+* CPU: Recent quad-core
+**Recommended:**
+* RAM: 32 GB
+* VRAM: 12 GB+ (e.g., RTX 4070 / 3080 or higher)
+* CPU: Modern multi-core
+> Quantized weights can reduce VRAM but may affect quality.
+---
+## 🙌 Acknowledgments
+Thanks to the authors and communities behind **Jan**, **Lightning**, **Intel Hebrew Math Tutor**, **Qwen3 Code Reasoning**, and the **LazyMergekit** toolchain.
+## 📝 License
+This model is licensed under the **Apache 2.0 License**.