gen781 — Evolved Qwen3-8B (M2N2 Breeder)
An 8B parameter language model created through 781 generations of automated evolutionary model merging, using the M2N2 (Model Merging with Nash-2) algorithm. This model is the all-time highest scoring champion from a breeding run of 927 generations and 749 total merges across a population of 24 Qwen3-8B fine-tunes.
Benchmark Results
| Benchmark | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 0.7774 |
| ARC-Easy | acc_norm | 0.8371 |
| ARC-Challenge | acc_norm | 0.6058 |
| WinoGrande | acc | 0.7395 |
| TruthfulQA (MC2) | mc2 | 0.5378 |
| MathQA | acc_norm | 0.5437 |
| MMLU | acc | 0.7319 |
| GSM8K | exact_match | 0.8690 |
Evaluated on full datasets using lm-evaluation-harness, bfloat16 precision. GSM8K limited to 1000 samples.
Improvement Over Base Qwen3-8B
Both models evaluated identically — same benchmarks, same full datasets, same hardware.
| Benchmark | Base Qwen3-8B | gen781 | Delta |
|---|---|---|---|
| HellaSwag | 0.7491 | 0.7774 | +0.0283 |
| ARC-Easy | 0.8093 | 0.8371 | +0.0278 |
| ARC-Challenge | 0.5674 | 0.6058 | +0.0384 |
| WinoGrande | 0.6772 | 0.7395 | +0.0623 |
| TruthfulQA (MC2) | 0.5449 | 0.5378 | -0.0071 |
| MathQA | 0.4938 | 0.5437 | +0.0499 |
| MMLU | 0.7293 | 0.7319 | +0.0026 |
| GSM8K | 0.8860 | 0.8690 | -0.0170 |
| Average | 0.6821 | 0.7053 | +0.0232 (+3.4%) |
Wins 6 of 8 benchmarks. The two slight losses (TruthfulQA, GSM8K) are consistent with gen781's evolved thinking behavior — it developed a higher activation threshold for chain-of-thought reasoning, trading some step-by-step math performance for broader domain capability. See Emergent Thinking Behavior below.
+3.4% over the base Qwen3-8B — achieved purely through evolutionary weight merging with zero additional training.
Improvement Over Seed Models
During breeding, all models (seeds and offspring) were evaluated on 5 screening benchmarks with 100 samples each. These scores are comparable across the population:
| Seed Model | Domain | 5-Task Avg |
|---|---|---|
| TheFinAI/Fin-o1-8B | Finance/Reasoning | 0.6876 |
| mlxha/Qwen3-8B-grpo-medmcqa | Medical | 0.6866 |
| gustavecortal/Piaget-8B | Psychology | 0.6861 |
| marketeam/Qwen-Marketing | Marketing | 0.6853 |
| nikhilchandak/OpenForecaster-8B | Forecasting | 0.6762 |
| FutureMa/Qwen3-8B-Drama-Thinking | Creative | 0.6675 |
| DragonLLM/Qwen-Open-Finance-R-8B | Finance | 0.6483 |
| AXCXEPT/Qwen3-EZO-8B-beta | Japanese/General | 0.6437 |
| Goedel-LM/Goedel-Prover-V2-8B | Math Proofs | 0.6412 |
| OpenMOSS-Team/Qwen3-8B-ABC | General | 0.6186 |
| Logics-MLLM/Logics-STEM-8B-SFT | STEM | 0.6095 |
| gen781 (this model) | Evolved | 0.7116 |
+3.5% over the best individual seed (Fin-o1-8B at 0.6876) — achieved through evolutionary selection pressure across 781 generations. Screening scores use 100 samples per task on 5 benchmarks (HellaSwag, ARC-Easy, WinoGrande, TruthfulQA, MathQA); see full-dataset results above for production-grade evaluation.
How This Model Was Created
The M2N2 Breeder System
This model was created by an automated breeding system that applies evolutionary algorithms to LLM weight merging:
Seed Population: 24 Qwen3-8B fine-tunes from HuggingFace, each specialized in different domains (finance, medicine, math, creative writing, STEM, etc.) — see Complete Seed Pool below
Breeding Loop (927 generations):
- Mate Selection: Tournament selection (size 3) combined with attraction-based pairing (weight 0.7), favoring genetically diverse high-performers
- Merging: Parents combined via SLERP (Spherical Linear Interpolation) or DELLA (magnitude-based adaptive pruning) using mergekit
- Evaluation: Every offspring evaluated on 5 screening benchmarks (ARC-Easy, HellaSwag, WinoGrande, TruthfulQA, MathQA) with 100 samples each for fast iteration. Final champion evaluated on full datasets plus MMLU, ARC-Challenge, and GSM8K.
- Selection: Only offspring scoring higher than BOTH parents survive as "champions"
- Archive: Population diversity maintained via M2N2 niche-based fitness (archive size 354)
Result: 749 merges produced 98 champions (13.1% champion rate). gen781 is the highest-scoring model across the entire run.
Merge Details
gen781 was created by merging two intermediate champion models via SLERP:
- Parent A: gen466 (score 0.7032) — itself a "super-breeder" with an 80% champion offspring rate
- Parent B: gen657 (score 0.6829) — carrying genetics from GCA (General Combining Ability) probe crosses and DELLA-method ancestors
base_model: merged_1d8e8e92_gen466
dtype: bfloat16
merge_method: slerp
parameters:
t: 0.4525960257756839
slices:
- parameters:
t: 0.4525960257756839
sources:
- layer_range: [0, 18]
model: merged_1d8e8e92_gen466
- layer_range: [0, 18]
model: merged_d72eb101_gen657
- parameters:
t: 0.5474039742243161
sources:
- layer_range: [18, 36]
model: merged_1d8e8e92_gen466
- layer_range: [18, 36]
model: merged_d72eb101_gen657
The evolved genome uses a near-equal split (layer 18/36) with a slight asymmetry: t=0.453 in lower layers (favoring parent A) and t=0.547 in upper layers (favoring parent B). These parameters were not hand-tuned — they emerged through generations of mutation and selection.
Ancestry
gen781's complete ancestry traces back to 11 original HuggingFace fine-tunes through a deep family tree spanning multiple merge methods:
Scores shown are 5-task screening scores (100 samples) used during the breeding run for selection.
gen781 (0.7116*) SLERP
├── gen466 (0.7032) SLERP
│ ├── gen371 (0.6846) SLERP
│ │ ├── GCA: gen10 x gen40 (0.6864)
│ │ │ ├── gen10: Goedel-Prover-V2-8B x Drama-Thinking (SLERP)
│ │ │ └── gen40: [Finance-R x Forecaster] x [Fin-o1 x EZO] (DELLA)
│ │ └── GCA: Drama-Thinking x gen40 (0.6921)
│ └── gen28: Finance-R x Logics-STEM (SLERP)
└── gen657 (0.6829) SLERP
├── GCA: gen161 x gen40 (0.6982)
│ └── gen161: deep SLERP lineage incorporating Fin-o1, Goedel-Prover,
│ Drama-Thinking, Piaget, medmcqa, Marketing, ABC, Forecaster
└── gen41: [Marketing x Finance-R x Drama-Thinking] x [Piaget x EZO] (DELLA)
*gen781's full-dataset 8-benchmark average is 0.7053. The 0.7116 screening score was used for selection during breeding.
A key finding from this breeding program: DELLA merges never produced the top champion directly, but DELLA-lineage ancestors appear throughout the family tree of the highest-scoring models. DELLA creates genetic diversity in early generations that SLERP then exploits in later crosses — a form of hybrid vigor.
Qualitative Behavior — What Survived the Merge
Benchmarks tell you accuracy. This section describes what the model actually does, based on hands-on testing across the domains of its 11 ancestor fine-tunes.
Domain Knowledge — Clearly Retained
- Medical (from Qwen3-8B-grpo-medmcqa): Correct eGFR thresholds for metformin, accurate lactic acidosis risk framing, appropriate alternative medication suggestions.
- Mathematical Reasoning (from Goedel-Prover-V2, Fin-o1): Full correct derivations with self-checking on math proofs.
- Forecasting (from OpenForecaster): Weighted factor analysis, historical Fed cycle references, worked numerical examples for economic questions.
- Marketing (from Qwen-Marketing): Structured buyer personas, channel prioritization frameworks, SEO-specific tactical knowledge.
- Multilingual (from Qwen3-EZO-8B-beta): Japanese fluency retained.
What Was Lost
- Screenplay formatting (from Drama-Thinking): gen781 writes prose, not scripts. The structural template did not survive.
- Deep thinking chains (from Drama-Thinking): The 3,400-token extended thinking blocks on creative tasks are gone — thinking blocks on creative prompts come back empty.
What Transformed
The Drama-Thinking fine-tune's most valuable contribution was not structure but style — atmospheric prose, emotional depth, visual storytelling, and structured narrative. These qualities survived the merge even though the screenplay format and long thinking chains did not.
Emergent Thinking Behavior
The most interesting emergent property. Base Qwen3-8B uses its thinking mode liberally. gen781 developed a higher activation threshold for thinking, almost certainly because the non-thinking fine-tunes (medical, marketing, finance, forecasting) outnumber the thinking-oriented ones (DeepSeek-R1, Goedel-Prover) in its ancestry.
The result:
- Thinks for deductive tasks, math, and context disambiguation in complex multi-turn conversations
- Skips thinking for direct domain knowledge retrieval (medical facts, marketing frameworks, etc.)
- Explicit "think step by step" prompting does not override this — the threshold is baked into the weights, not the prompt layer
The net characterization: a broad, stable generalist that traded thinking depth for domain breadth — which for most practical use cases is the right tradeoff.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "ryanfortin/community-blend-qwen3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain the mechanism of action of metformin."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Model Architecture
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Parameters | ~8B |
| Layers | 36 |
| Hidden Size | 4096 |
| Attention Heads | 32 |
| KV Heads | 8 (GQA) |
| FFN Size | 12288 |
| Vocab Size | 151,936 |
| Max Context | 40,960 tokens |
| RoPE Theta | 1,000,000 |
| Precision | bfloat16 |
Breeding Statistics
| Metric | Value |
|---|---|
| Generation | 781 of 927 |
| Total Merges in Run | 749 |
| Total Champions | 98 (13.1% rate) |
| Seed Pool Size | 24 models |
| Archive Size | 354 |
| Merge Methods Used | SLERP + DELLA |
| Ancestor HF Models | 11 |
| Ancestor Merged Models | 30+ intermediates |
Complete Seed Pool
All 24 Qwen3-8B fine-tunes used in the breeding population. Models marked with * are direct ancestors of gen781 (their weights are in this model). The remaining 13 were evaluated and crossed during breeding but their genetics did not survive to gen781.
Acknowledgments
Tools and Methods:
- mergekit by Charles Goddard et al. (Goddard et al., 2024) for model merging
- lm-evaluation-harness by EleutherAI for evaluation
- Evolutionary model merging concept from Sakana AI (Akiba et al., 2025), published in Nature Machine Intelligence
- DELLA-Merging by Deep et al. for magnitude-based adaptive pruning
Base Architecture:
- Qwen3-8B by the Qwen Team at Alibaba (Yang et al., 2025)
Seed Model Authors: Thank you to all 24 seed model creators whose fine-tunes made this breeding experiment possible. Special thanks to the 11 ancestor model authors whose work directly contributes to gen781's weights — see the Complete Seed Pool table above for the full list with links.
Citations
If you use this model, please cite the base architecture and the tools/methods used.
Core — Architecture and Methods
- Qwen3 Technical Report — Qwen Team, 2025 — Paper | Model
- Evolutionary Optimization of Model Merging Recipes — Akiba et al., 2025 — Paper (Nature Machine Intelligence) | arXiv | Sakana AI
- Arcee's MergeKit: A Toolkit for Merging Large Language Models — Goddard et al., 2024 — Paper | GitHub
- DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling — Deep et al., 2024 — Paper
- lm-evaluation-harness — EleutherAI — GitHub
Ancestor Seed Models (weights in this model)
- Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance — Qian et al., 2025 — Paper | Model
- Goedel-Prover-V2: Scaling Formal Theorem Proving — Lin et al., 2025 — Paper | Model
- Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training — Xu et al., 2026 — Paper | Model
- Scaling Open-Ended Reasoning to Predict the Future (OpenForecaster) — Chandak et al., 2025 — Paper | Model
- ABC-Bench: Benchmarking Agentic Backend Coding — Yang et al., 2026 — Paper | Model
- Qwen3-8B-Drama-Thinking — FutureMa — Model
- Qwen-Open-Finance-R-8B — DragonLLM — Model
- Piaget-8B — gustavecortal — Model
- Qwen3-EZO-8B-beta — AXCXEPT — Model
- Qwen-Marketing — Marketeam — Model
- Qwen3-8B-grpo-medmcqa — mlxha — Model
Other Seed Pool Models with Papers (evaluated during breeding, not in final ancestry)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI, 2025 — Paper | Model
- MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning — Fan et al., 2025 — Paper | Model
- ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration — Su et al., 2025 — Paper | Model
- Light-IF: Endowing LLMs with Generalizable Reasoning — Wang et al., 2025 — Paper | Model
- SERA: Soft-Verified Efficient Repository Agents — Shen et al., 2026 — Paper | Model
- TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning — Yang et al., 2025 — Paper | Model
- Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning — Ruan et al., 2025 — Paper | Model
- SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents — Cai et al., 2025 — Paper | Model
- OpenThinker-Agent — OpenThoughts — Blog | Model
- Qwen3-8B-ShiningValiant3 — Valiant Labs — Model
- Qwen3-8B-Esper3 — Valiant Labs — Model
- Qwen3-psychological-reasoning-8B — gustavecortal — Model
BibTeX (click to expand)
% Base Architecture
@misc{qwen3,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388}
}
% Evolutionary Model Merging
@article{akiba2025evolutionary,
title={Evolutionary Optimization of Model Merging Recipes},
author={Akiba, Takuya and Shing, Makoto and Tang, Yujin and Sun, Qi and Ha, David},
journal={Nature Machine Intelligence},
year={2025},
doi={10.1038/s42256-024-00975-8}
}
% MergeKit
@article{goddard2024mergekit,
title={Arcee's MergeKit: A Toolkit for Merging Large Language Models},
author={Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vlad and Benedict, Brian and McQuade, Mark and Solawetz, Jacob},
journal={arXiv preprint arXiv:2403.13257},
year={2024}
}
% DELLA-Merging
@article{deep2024dellamerging,
title={DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling},
author={Deep, Pala Tej and Bhardwaj, Rishabh and Poria, Soujanya},
journal={arXiv preprint arXiv:2406.11617},
year={2024}
}
% Ancestor Seed Models
@article{qian2025fino1,
title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance},
author={Qian, Lingfei and Zhou, Weipeng and Wang, Yan and Peng, Xueqing and Huang, Jimin and Xie, Qianqian},
journal={arXiv preprint arXiv:2502.08127},
year={2025}
}
@article{lin2025goedelproverv2,
title={Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction},
author={Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and others},
journal={arXiv preprint arXiv:2508.03613},
year={2025}
}
@misc{xu2026logicsstem,
title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement},
author={Xu, Mingyu and Fang, Cheng and Jiang, Keyue and Zheng, Yuqian and Xiao, Yanghua and Zhou, Baojian and Zhao, Qifang and Zheng, Suhang and Zhu, Xiuwen and Tang, Jiyang and others},
year={2026},
eprint={2601.01562},
archivePrefix={arXiv}
}
@article{chandak2025openforecaster,
title={Scaling Open-Ended Reasoning to Predict the Future},
author={Chandak, Nikhil and Goel, Shashwat and Prabhu, Ameya and Hardt, Moritz and Geiping, Jonas},
journal={arXiv preprint arXiv:2512.25070},
year={2025}
}
@misc{yang2026abcbench,
title={ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development},
author={Yang, Jie and Guo, Honglin and Ji, Li and Zhou, Jiazheng and Zheng, Rui and Lei, Zhikai and Zhang, Shuo and Xi, Zhiheng and Liu, Shichun and Wang, Yuxin and others},
year={2026},
eprint={2601.11077},
archivePrefix={arXiv}
}
% Other Seed Pool Models
@misc{deepseekai2025deepseekr1,
title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
author={DeepSeek-AI},
year={2025},
eprint={2501.12948},
archivePrefix={arXiv}
}
@article{fan2025megascience,
title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning},
author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei},
journal={arXiv preprint arXiv:2507.16812},
year={2025}
}
@misc{su2025toolorchestra,
title={ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
author={Su, Hongjin and Diao, Shizhe and Lu, Ximing and Liu, Mingjie and Xu, Jiacheng and Dong, Xin and Fu, Yonggan and others},
year={2025},
eprint={2511.21689},
archivePrefix={arXiv}
}
@misc{wang2025lightif,
title={Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following},
author={Wang, Chenyang and Wen, Liang and Jia, Shousheng and Zhang, Xiangzheng and Xu, Liang},
year={2025},
eprint={2508.03178},
archivePrefix={arXiv}
}
@misc{shen2026sera,
title={SERA: Soft-Verified Efficient Repository Agents},
author={Shen, Ethan and Tormoen, Danny and Shah, Saurabh and Farhadi, Ali and Dettmers, Tim},
year={2026},
eprint={2601.20789},
archivePrefix={arXiv}
}
@misc{yang2025tablegptr1,
title={TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning},
author={Yang, Saisai and Huang, Qingyi and Yuan, Jing and Zha, Liangyu and Tang, Kai and others},
year={2025},
eprint={2512.20312},
archivePrefix={arXiv}
}
@article{ruan2025critiquecoder,
title={Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning},
author={Ruan, Chi and Jiang, Dongfu and Wang, Yubo and Chen, Wenhu},
journal={arXiv preprint arXiv:2509.22824},
year={2025}
}
@article{cai2025smartsnap,
title={SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents},
author={Cai, Shaofei and Qin, Yulei and Lin, Haojia and Xu, Zihan and Li, Gang and Shi, Yuchen and others},
journal={arXiv preprint arXiv:2512.22322},
year={2025}
}
License
This model inherits the license terms of its constituent models. All 11 ancestor models use permissive licenses:
- Apache 2.0: Qwen3-8B (base), Fin-o1-8B, Goedel-Prover-V2-8B, Drama-Thinking, Finance-R, Logics-STEM, EZO-8B-beta, medmcqa, Qwen3-8B-ABC
- MIT: Piaget-8B, Qwen-Marketing, OpenForecaster-8B
Both Apache 2.0 and MIT are permissive licenses that allow commercial use, modification, and redistribution. This model is released under Apache 2.0 following the base Qwen3-8B license.
- Downloads last month
- 32
Model tree for ryanfortin/community-blend-qwen3-8b
Papers for ryanfortin/community-blend-qwen3-8b
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement
Scaling Open-Ended Reasoning to Predict the Future
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
Evaluation results
- Accuracy (Normalized) on HellaSwagself-reported0.777
- Accuracy (Normalized) on ARC-Easyself-reported0.837
- Accuracy (Normalized) on ARC-Challengeself-reported0.606
- Accuracy on WinoGrandeself-reported0.740
- MC2 on TruthfulQAself-reported0.538
- Accuracy (Normalized) on MathQAself-reported0.544
- Accuracy on MMLUself-reported0.732
- Exact Match on GSM8Kself-reported0.869