gen781 — Evolved Qwen3-8B (M2N2 Breeder)

An 8B parameter language model created through 781 generations of automated evolutionary model merging, using the M2N2 (Model Merging with Nash-2) algorithm. This model is the all-time highest scoring champion from a breeding run of 927 generations and 749 total merges across a population of 24 Qwen3-8B fine-tunes.

Benchmark Results

Benchmark Metric Score
HellaSwag acc_norm 0.7774
ARC-Easy acc_norm 0.8371
ARC-Challenge acc_norm 0.6058
WinoGrande acc 0.7395
TruthfulQA (MC2) mc2 0.5378
MathQA acc_norm 0.5437
MMLU acc 0.7319
GSM8K exact_match 0.8690

Evaluated on full datasets using lm-evaluation-harness, bfloat16 precision. GSM8K limited to 1000 samples.

Improvement Over Base Qwen3-8B

Both models evaluated identically — same benchmarks, same full datasets, same hardware.

Benchmark Base Qwen3-8B gen781 Delta
HellaSwag 0.7491 0.7774 +0.0283
ARC-Easy 0.8093 0.8371 +0.0278
ARC-Challenge 0.5674 0.6058 +0.0384
WinoGrande 0.6772 0.7395 +0.0623
TruthfulQA (MC2) 0.5449 0.5378 -0.0071
MathQA 0.4938 0.5437 +0.0499
MMLU 0.7293 0.7319 +0.0026
GSM8K 0.8860 0.8690 -0.0170
Average 0.6821 0.7053 +0.0232 (+3.4%)

Wins 6 of 8 benchmarks. The two slight losses (TruthfulQA, GSM8K) are consistent with gen781's evolved thinking behavior — it developed a higher activation threshold for chain-of-thought reasoning, trading some step-by-step math performance for broader domain capability. See Emergent Thinking Behavior below.

+3.4% over the base Qwen3-8B — achieved purely through evolutionary weight merging with zero additional training.

Improvement Over Seed Models

During breeding, all models (seeds and offspring) were evaluated on 5 screening benchmarks with 100 samples each. These scores are comparable across the population:

Seed Model Domain 5-Task Avg
TheFinAI/Fin-o1-8B Finance/Reasoning 0.6876
mlxha/Qwen3-8B-grpo-medmcqa Medical 0.6866
gustavecortal/Piaget-8B Psychology 0.6861
marketeam/Qwen-Marketing Marketing 0.6853
nikhilchandak/OpenForecaster-8B Forecasting 0.6762
FutureMa/Qwen3-8B-Drama-Thinking Creative 0.6675
DragonLLM/Qwen-Open-Finance-R-8B Finance 0.6483
AXCXEPT/Qwen3-EZO-8B-beta Japanese/General 0.6437
Goedel-LM/Goedel-Prover-V2-8B Math Proofs 0.6412
OpenMOSS-Team/Qwen3-8B-ABC General 0.6186
Logics-MLLM/Logics-STEM-8B-SFT STEM 0.6095
gen781 (this model) Evolved 0.7116

+3.5% over the best individual seed (Fin-o1-8B at 0.6876) — achieved through evolutionary selection pressure across 781 generations. Screening scores use 100 samples per task on 5 benchmarks (HellaSwag, ARC-Easy, WinoGrande, TruthfulQA, MathQA); see full-dataset results above for production-grade evaluation.

How This Model Was Created

The M2N2 Breeder System

This model was created by an automated breeding system that applies evolutionary algorithms to LLM weight merging:

  1. Seed Population: 24 Qwen3-8B fine-tunes from HuggingFace, each specialized in different domains (finance, medicine, math, creative writing, STEM, etc.) — see Complete Seed Pool below

  2. Breeding Loop (927 generations):

    • Mate Selection: Tournament selection (size 3) combined with attraction-based pairing (weight 0.7), favoring genetically diverse high-performers
    • Merging: Parents combined via SLERP (Spherical Linear Interpolation) or DELLA (magnitude-based adaptive pruning) using mergekit
    • Evaluation: Every offspring evaluated on 5 screening benchmarks (ARC-Easy, HellaSwag, WinoGrande, TruthfulQA, MathQA) with 100 samples each for fast iteration. Final champion evaluated on full datasets plus MMLU, ARC-Challenge, and GSM8K.
    • Selection: Only offspring scoring higher than BOTH parents survive as "champions"
    • Archive: Population diversity maintained via M2N2 niche-based fitness (archive size 354)
  3. Result: 749 merges produced 98 champions (13.1% champion rate). gen781 is the highest-scoring model across the entire run.

Merge Details

gen781 was created by merging two intermediate champion models via SLERP:

  • Parent A: gen466 (score 0.7032) — itself a "super-breeder" with an 80% champion offspring rate
  • Parent B: gen657 (score 0.6829) — carrying genetics from GCA (General Combining Ability) probe crosses and DELLA-method ancestors
base_model: merged_1d8e8e92_gen466
dtype: bfloat16
merge_method: slerp
parameters:
  t: 0.4525960257756839
slices:
- parameters:
    t: 0.4525960257756839
  sources:
  - layer_range: [0, 18]
    model: merged_1d8e8e92_gen466
  - layer_range: [0, 18]
    model: merged_d72eb101_gen657
- parameters:
    t: 0.5474039742243161
  sources:
  - layer_range: [18, 36]
    model: merged_1d8e8e92_gen466
  - layer_range: [18, 36]
    model: merged_d72eb101_gen657

The evolved genome uses a near-equal split (layer 18/36) with a slight asymmetry: t=0.453 in lower layers (favoring parent A) and t=0.547 in upper layers (favoring parent B). These parameters were not hand-tuned — they emerged through generations of mutation and selection.

Ancestry

gen781's complete ancestry traces back to 11 original HuggingFace fine-tunes through a deep family tree spanning multiple merge methods:

Scores shown are 5-task screening scores (100 samples) used during the breeding run for selection.

gen781 (0.7116*) SLERP
├── gen466 (0.7032) SLERP
│   ├── gen371 (0.6846) SLERP
│   │   ├── GCA: gen10 x gen40 (0.6864)
│   │   │   ├── gen10: Goedel-Prover-V2-8B x Drama-Thinking (SLERP)
│   │   │   └── gen40: [Finance-R x Forecaster] x [Fin-o1 x EZO] (DELLA)
│   │   └── GCA: Drama-Thinking x gen40 (0.6921)
│   └── gen28: Finance-R x Logics-STEM (SLERP)
└── gen657 (0.6829) SLERP
    ├── GCA: gen161 x gen40 (0.6982)
    │   └── gen161: deep SLERP lineage incorporating Fin-o1, Goedel-Prover,
    │       Drama-Thinking, Piaget, medmcqa, Marketing, ABC, Forecaster
    └── gen41: [Marketing x Finance-R x Drama-Thinking] x [Piaget x EZO] (DELLA)

*gen781's full-dataset 8-benchmark average is 0.7053. The 0.7116 screening score was used for selection during breeding.

A key finding from this breeding program: DELLA merges never produced the top champion directly, but DELLA-lineage ancestors appear throughout the family tree of the highest-scoring models. DELLA creates genetic diversity in early generations that SLERP then exploits in later crosses — a form of hybrid vigor.

Qualitative Behavior — What Survived the Merge

Benchmarks tell you accuracy. This section describes what the model actually does, based on hands-on testing across the domains of its 11 ancestor fine-tunes.

Domain Knowledge — Clearly Retained

  • Medical (from Qwen3-8B-grpo-medmcqa): Correct eGFR thresholds for metformin, accurate lactic acidosis risk framing, appropriate alternative medication suggestions.
  • Mathematical Reasoning (from Goedel-Prover-V2, Fin-o1): Full correct derivations with self-checking on math proofs.
  • Forecasting (from OpenForecaster): Weighted factor analysis, historical Fed cycle references, worked numerical examples for economic questions.
  • Marketing (from Qwen-Marketing): Structured buyer personas, channel prioritization frameworks, SEO-specific tactical knowledge.
  • Multilingual (from Qwen3-EZO-8B-beta): Japanese fluency retained.

What Was Lost

  • Screenplay formatting (from Drama-Thinking): gen781 writes prose, not scripts. The structural template did not survive.
  • Deep thinking chains (from Drama-Thinking): The 3,400-token extended thinking blocks on creative tasks are gone — thinking blocks on creative prompts come back empty.

What Transformed

The Drama-Thinking fine-tune's most valuable contribution was not structure but style — atmospheric prose, emotional depth, visual storytelling, and structured narrative. These qualities survived the merge even though the screenplay format and long thinking chains did not.

Emergent Thinking Behavior

The most interesting emergent property. Base Qwen3-8B uses its thinking mode liberally. gen781 developed a higher activation threshold for thinking, almost certainly because the non-thinking fine-tunes (medical, marketing, finance, forecasting) outnumber the thinking-oriented ones (DeepSeek-R1, Goedel-Prover) in its ancestry.

The result:

  • Thinks for deductive tasks, math, and context disambiguation in complex multi-turn conversations
  • Skips thinking for direct domain knowledge retrieval (medical facts, marketing frameworks, etc.)
  • Explicit "think step by step" prompting does not override this — the threshold is baked into the weights, not the prompt layer

The net characterization: a broad, stable generalist that traded thinking depth for domain breadth — which for most practical use cases is the right tradeoff.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ryanfortin/community-blend-qwen3-8b"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain the mechanism of action of metformin."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Model Architecture

Parameter Value
Architecture Qwen3ForCausalLM
Parameters ~8B
Layers 36
Hidden Size 4096
Attention Heads 32
KV Heads 8 (GQA)
FFN Size 12288
Vocab Size 151,936
Max Context 40,960 tokens
RoPE Theta 1,000,000
Precision bfloat16

Breeding Statistics

Metric Value
Generation 781 of 927
Total Merges in Run 749
Total Champions 98 (13.1% rate)
Seed Pool Size 24 models
Archive Size 354
Merge Methods Used SLERP + DELLA
Ancestor HF Models 11
Ancestor Merged Models 30+ intermediates

Complete Seed Pool

All 24 Qwen3-8B fine-tunes used in the breeding population. Models marked with * are direct ancestors of gen781 (their weights are in this model). The remaining 13 were evaluated and crossed during breeding but their genetics did not survive to gen781.

Model Domain License Ancestor?
Qwen/Qwen3-8B Base architecture Apache 2.0 Base
TheFinAI/Fin-o1-8B Finance/Reasoning Apache 2.0 *
Goedel-LM/Goedel-Prover-V2-8B Math Proofs Apache 2.0 *
FutureMa/Qwen3-8B-Drama-Thinking Creative Writing Apache 2.0 *
DragonLLM/Qwen-Open-Finance-R-8B Finance Apache 2.0 *
Logics-MLLM/Logics-STEM-8B-SFT STEM Apache 2.0 *
gustavecortal/Piaget-8B Psychology MIT *
AXCXEPT/Qwen3-EZO-8B-beta Japanese/General Apache 2.0 *
marketeam/Qwen-Marketing Marketing MIT *
nikhilchandak/OpenForecaster-8B Forecasting MIT *
mlxha/Qwen3-8B-grpo-medmcqa Medical Apache 2.0 *
OpenMOSS-Team/Qwen3-8B-ABC General/Coding Apache 2.0 *
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B Reasoning MIT
nvidia/Nemotron-Orchestrator-8B Agentic/Orchestration NVIDIA License
open-thoughts/OpenThinker-Agent-v1 Agents/Tools Apache 2.0
tablegpt/TableGPT-R1 Data/SQL Apache 2.0
allenai/SERA-8B Repository Agents Apache 2.0
TIGER-Lab/Critique-Coder-8B Code Review Apache 2.0
qihoo360/Light-IF-8B Instruction Following Apache 2.0
MegaScience/Qwen3-8B-MegaScience Science Apache 2.0
ValiantLabs/Qwen3-8B-ShiningValiant3 Instruction Following Apache 2.0
ValiantLabs/Qwen3-8B-Esper3 Coding/Reasoning Apache 2.0
gustavecortal/Qwen3-psychological-reasoning-8B Psychology MIT
yolay/SmartSnap-Qwen3-8B Agentic Apache 2.0

Acknowledgments

Tools and Methods:

Base Architecture:

Seed Model Authors: Thank you to all 24 seed model creators whose fine-tunes made this breeding experiment possible. Special thanks to the 11 ancestor model authors whose work directly contributes to gen781's weights — see the Complete Seed Pool table above for the full list with links.

Citations

If you use this model, please cite the base architecture and the tools/methods used.

Core — Architecture and Methods

  • Qwen3 Technical Report — Qwen Team, 2025 — Paper | Model
  • Evolutionary Optimization of Model Merging Recipes — Akiba et al., 2025 — Paper (Nature Machine Intelligence) | arXiv | Sakana AI
  • Arcee's MergeKit: A Toolkit for Merging Large Language Models — Goddard et al., 2024 — Paper | GitHub
  • DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling — Deep et al., 2024 — Paper
  • lm-evaluation-harness — EleutherAI — GitHub

Ancestor Seed Models (weights in this model)

  • Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance — Qian et al., 2025 — Paper | Model
  • Goedel-Prover-V2: Scaling Formal Theorem Proving — Lin et al., 2025 — Paper | Model
  • Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training — Xu et al., 2026 — Paper | Model
  • Scaling Open-Ended Reasoning to Predict the Future (OpenForecaster) — Chandak et al., 2025 — Paper | Model
  • ABC-Bench: Benchmarking Agentic Backend Coding — Yang et al., 2026 — Paper | Model
  • Qwen3-8B-Drama-Thinking — FutureMa — Model
  • Qwen-Open-Finance-R-8B — DragonLLM — Model
  • Piaget-8B — gustavecortal — Model
  • Qwen3-EZO-8B-beta — AXCXEPT — Model
  • Qwen-Marketing — Marketeam — Model
  • Qwen3-8B-grpo-medmcqa — mlxha — Model

Other Seed Pool Models with Papers (evaluated during breeding, not in final ancestry)

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI, 2025 — Paper | Model
  • MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning — Fan et al., 2025 — Paper | Model
  • ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration — Su et al., 2025 — Paper | Model
  • Light-IF: Endowing LLMs with Generalizable Reasoning — Wang et al., 2025 — Paper | Model
  • SERA: Soft-Verified Efficient Repository Agents — Shen et al., 2026 — Paper | Model
  • TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning — Yang et al., 2025 — Paper | Model
  • Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning — Ruan et al., 2025 — Paper | Model
  • SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents — Cai et al., 2025 — Paper | Model
  • OpenThinker-Agent — OpenThoughts — Blog | Model
  • Qwen3-8B-ShiningValiant3 — Valiant Labs — Model
  • Qwen3-8B-Esper3 — Valiant Labs — Model
  • Qwen3-psychological-reasoning-8B — gustavecortal — Model
BibTeX (click to expand)
% Base Architecture
@misc{qwen3,
    title={Qwen3 Technical Report},
    author={Qwen Team},
    year={2025},
    eprint={2505.09388},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2505.09388}
}

% Evolutionary Model Merging
@article{akiba2025evolutionary,
    title={Evolutionary Optimization of Model Merging Recipes},
    author={Akiba, Takuya and Shing, Makoto and Tang, Yujin and Sun, Qi and Ha, David},
    journal={Nature Machine Intelligence},
    year={2025},
    doi={10.1038/s42256-024-00975-8}
}

% MergeKit
@article{goddard2024mergekit,
    title={Arcee's MergeKit: A Toolkit for Merging Large Language Models},
    author={Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vlad and Benedict, Brian and McQuade, Mark and Solawetz, Jacob},
    journal={arXiv preprint arXiv:2403.13257},
    year={2024}
}

% DELLA-Merging
@article{deep2024dellamerging,
    title={DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling},
    author={Deep, Pala Tej and Bhardwaj, Rishabh and Poria, Soujanya},
    journal={arXiv preprint arXiv:2406.11617},
    year={2024}
}

% Ancestor Seed Models
@article{qian2025fino1,
    title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance},
    author={Qian, Lingfei and Zhou, Weipeng and Wang, Yan and Peng, Xueqing and Huang, Jimin and Xie, Qianqian},
    journal={arXiv preprint arXiv:2502.08127},
    year={2025}
}

@article{lin2025goedelproverv2,
    title={Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction},
    author={Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and others},
    journal={arXiv preprint arXiv:2508.03613},
    year={2025}
}

@misc{xu2026logicsstem,
    title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement},
    author={Xu, Mingyu and Fang, Cheng and Jiang, Keyue and Zheng, Yuqian and Xiao, Yanghua and Zhou, Baojian and Zhao, Qifang and Zheng, Suhang and Zhu, Xiuwen and Tang, Jiyang and others},
    year={2026},
    eprint={2601.01562},
    archivePrefix={arXiv}
}

@article{chandak2025openforecaster,
    title={Scaling Open-Ended Reasoning to Predict the Future},
    author={Chandak, Nikhil and Goel, Shashwat and Prabhu, Ameya and Hardt, Moritz and Geiping, Jonas},
    journal={arXiv preprint arXiv:2512.25070},
    year={2025}
}

@misc{yang2026abcbench,
    title={ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development},
    author={Yang, Jie and Guo, Honglin and Ji, Li and Zhou, Jiazheng and Zheng, Rui and Lei, Zhikai and Zhang, Shuo and Xi, Zhiheng and Liu, Shichun and Wang, Yuxin and others},
    year={2026},
    eprint={2601.11077},
    archivePrefix={arXiv}
}

% Other Seed Pool Models
@misc{deepseekai2025deepseekr1,
    title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
    author={DeepSeek-AI},
    year={2025},
    eprint={2501.12948},
    archivePrefix={arXiv}
}

@article{fan2025megascience,
    title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning},
    author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei},
    journal={arXiv preprint arXiv:2507.16812},
    year={2025}
}

@misc{su2025toolorchestra,
    title={ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
    author={Su, Hongjin and Diao, Shizhe and Lu, Ximing and Liu, Mingjie and Xu, Jiacheng and Dong, Xin and Fu, Yonggan and others},
    year={2025},
    eprint={2511.21689},
    archivePrefix={arXiv}
}

@misc{wang2025lightif,
    title={Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following},
    author={Wang, Chenyang and Wen, Liang and Jia, Shousheng and Zhang, Xiangzheng and Xu, Liang},
    year={2025},
    eprint={2508.03178},
    archivePrefix={arXiv}
}

@misc{shen2026sera,
    title={SERA: Soft-Verified Efficient Repository Agents},
    author={Shen, Ethan and Tormoen, Danny and Shah, Saurabh and Farhadi, Ali and Dettmers, Tim},
    year={2026},
    eprint={2601.20789},
    archivePrefix={arXiv}
}

@misc{yang2025tablegptr1,
    title={TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning},
    author={Yang, Saisai and Huang, Qingyi and Yuan, Jing and Zha, Liangyu and Tang, Kai and others},
    year={2025},
    eprint={2512.20312},
    archivePrefix={arXiv}
}

@article{ruan2025critiquecoder,
    title={Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning},
    author={Ruan, Chi and Jiang, Dongfu and Wang, Yubo and Chen, Wenhu},
    journal={arXiv preprint arXiv:2509.22824},
    year={2025}
}

@article{cai2025smartsnap,
    title={SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents},
    author={Cai, Shaofei and Qin, Yulei and Lin, Haojia and Xu, Zihan and Li, Gang and Shi, Yuchen and others},
    journal={arXiv preprint arXiv:2512.22322},
    year={2025}
}

License

This model inherits the license terms of its constituent models. All 11 ancestor models use permissive licenses:

  • Apache 2.0: Qwen3-8B (base), Fin-o1-8B, Goedel-Prover-V2-8B, Drama-Thinking, Finance-R, Logics-STEM, EZO-8B-beta, medmcqa, Qwen3-8B-ABC
  • MIT: Piaget-8B, Qwen-Marketing, OpenForecaster-8B

Both Apache 2.0 and MIT are permissive licenses that allow commercial use, modification, and redistribution. This model is released under Apache 2.0 following the base Qwen3-8B license.

Downloads last month
32
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ryanfortin/community-blend-qwen3-8b

Papers for ryanfortin/community-blend-qwen3-8b

Evaluation results