gen781 — Evolved Qwen3-8B (M2N2 Breeder)

An 8B parameter language model created through 781 generations of automated evolutionary model merging, using the M2N2 (Model Merging with Nash-2) algorithm. This model is the all-time highest scoring champion from a breeding run of 927 generations and 749 total merges across a population of 24 Qwen3-8B fine-tunes.

Benchmark Results

Benchmark	Metric	Score
HellaSwag	acc_norm	0.7774
ARC-Easy	acc_norm	0.8371
ARC-Challenge	acc_norm	0.6058
WinoGrande	acc	0.7395
TruthfulQA (MC2)	mc2	0.5378
MathQA	acc_norm	0.5437
MMLU	acc	0.7319
GSM8K	exact_match	0.8690

Evaluated on full datasets using lm-evaluation-harness, bfloat16 precision. GSM8K limited to 1000 samples.

Improvement Over Base Qwen3-8B

Both models evaluated identically — same benchmarks, same full datasets, same hardware.

Benchmark	Base Qwen3-8B	gen781	Delta
HellaSwag	0.7491	0.7774	+0.0283
ARC-Easy	0.8093	0.8371	+0.0278
ARC-Challenge	0.5674	0.6058	+0.0384
WinoGrande	0.6772	0.7395	+0.0623
TruthfulQA (MC2)	0.5449	0.5378	-0.0071
MathQA	0.4938	0.5437	+0.0499
MMLU	0.7293	0.7319	+0.0026
GSM8K	0.8860	0.8690	-0.0170
Average	0.6821	0.7053	+0.0232 (+3.4%)

Wins 6 of 8 benchmarks. The two slight losses (TruthfulQA, GSM8K) are consistent with gen781's evolved thinking behavior — it developed a higher activation threshold for chain-of-thought reasoning, trading some step-by-step math performance for broader domain capability. See Emergent Thinking Behavior below.

+3.4% over the base Qwen3-8B — achieved purely through evolutionary weight merging with zero additional training.

Improvement Over Seed Models

During breeding, all models (seeds and offspring) were evaluated on 5 screening benchmarks with 100 samples each. These scores are comparable across the population:

Seed Model	Domain	5-Task Avg
TheFinAI/Fin-o1-8B	Finance/Reasoning	0.6876
mlxha/Qwen3-8B-grpo-medmcqa	Medical	0.6866
gustavecortal/Piaget-8B	Psychology	0.6861
marketeam/Qwen-Marketing	Marketing	0.6853
nikhilchandak/OpenForecaster-8B	Forecasting	0.6762
FutureMa/Qwen3-8B-Drama-Thinking	Creative	0.6675
DragonLLM/Qwen-Open-Finance-R-8B	Finance	0.6483
AXCXEPT/Qwen3-EZO-8B-beta	Japanese/General	0.6437
Goedel-LM/Goedel-Prover-V2-8B	Math Proofs	0.6412
OpenMOSS-Team/Qwen3-8B-ABC	General	0.6186
Logics-MLLM/Logics-STEM-8B-SFT	STEM	0.6095
gen781 (this model)	Evolved	0.7116

+3.5% over the best individual seed (Fin-o1-8B at 0.6876) — achieved through evolutionary selection pressure across 781 generations. Screening scores use 100 samples per task on 5 benchmarks (HellaSwag, ARC-Easy, WinoGrande, TruthfulQA, MathQA); see full-dataset results above for production-grade evaluation.

How This Model Was Created

The M2N2 Breeder System

This model was created by an automated breeding system that applies evolutionary algorithms to LLM weight merging:

Seed Population: 24 Qwen3-8B fine-tunes from HuggingFace, each specialized in different domains (finance, medicine, math, creative writing, STEM, etc.) — see Complete Seed Pool below
Breeding Loop (927 generations):
- Mate Selection: Tournament selection (size 3) combined with attraction-based pairing (weight 0.7), favoring genetically diverse high-performers
- Merging: Parents combined via SLERP (Spherical Linear Interpolation) or DELLA (magnitude-based adaptive pruning) using mergekit
- Evaluation: Every offspring evaluated on 5 screening benchmarks (ARC-Easy, HellaSwag, WinoGrande, TruthfulQA, MathQA) with 100 samples each for fast iteration. Final champion evaluated on full datasets plus MMLU, ARC-Challenge, and GSM8K.
- Selection: Only offspring scoring higher than BOTH parents survive as "champions"
- Archive: Population diversity maintained via M2N2 niche-based fitness (archive size 354)
Result: 749 merges produced 98 champions (13.1% champion rate). gen781 is the highest-scoring model across the entire run.

Merge Details

gen781 was created by merging two intermediate champion models via SLERP:

Parent A: gen466 (score 0.7032) — itself a "super-breeder" with an 80% champion offspring rate
Parent B: gen657 (score 0.6829) — carrying genetics from GCA (General Combining Ability) probe crosses and DELLA-method ancestors

base_model: merged_1d8e8e92_gen466
dtype: bfloat16
merge_method: slerp
parameters:
  t: 0.4525960257756839
slices:
- parameters:
    t: 0.4525960257756839
  sources:
  - layer_range: [0, 18]
    model: merged_1d8e8e92_gen466
  - layer_range: [0, 18]
    model: merged_d72eb101_gen657
- parameters:
    t: 0.5474039742243161
  sources:
  - layer_range: [18, 36]
    model: merged_1d8e8e92_gen466
  - layer_range: [18, 36]
    model: merged_d72eb101_gen657

The evolved genome uses a near-equal split (layer 18/36) with a slight asymmetry: t=0.453 in lower layers (favoring parent A) and t=0.547 in upper layers (favoring parent B). These parameters were not hand-tuned — they emerged through generations of mutation and selection.

Ancestry

gen781's complete ancestry traces back to 11 original HuggingFace fine-tunes through a deep family tree spanning multiple merge methods:

Scores shown are 5-task screening scores (100 samples) used during the breeding run for selection.

gen781 (0.7116*) SLERP
├── gen466 (0.7032) SLERP
│   ├── gen371 (0.6846) SLERP
│   │   ├── GCA: gen10 x gen40 (0.6864)
│   │   │   ├── gen10: Goedel-Prover-V2-8B x Drama-Thinking (SLERP)
│   │   │   └── gen40: [Finance-R x Forecaster] x [Fin-o1 x EZO] (DELLA)
│   │   └── GCA: Drama-Thinking x gen40 (0.6921)
│   └── gen28: Finance-R x Logics-STEM (SLERP)
└── gen657 (0.6829) SLERP
    ├── GCA: gen161 x gen40 (0.6982)
    │   └── gen161: deep SLERP lineage incorporating Fin-o1, Goedel-Prover,
    │       Drama-Thinking, Piaget, medmcqa, Marketing, ABC, Forecaster
    └── gen41: [Marketing x Finance-R x Drama-Thinking] x [Piaget x EZO] (DELLA)

*gen781's full-dataset 8-benchmark average is 0.7053. The 0.7116 screening score was used for selection during breeding.

A key finding from this breeding program: DELLA merges never produced the top champion directly, but DELLA-lineage ancestors appear throughout the family tree of the highest-scoring models. DELLA creates genetic diversity in early generations that SLERP then exploits in later crosses — a form of hybrid vigor.

Qualitative Behavior — What Survived the Merge

Benchmarks tell you accuracy. This section describes what the model actually does, based on hands-on testing across the domains of its 11 ancestor fine-tunes.

Domain Knowledge — Clearly Retained

Medical (from Qwen3-8B-grpo-medmcqa): Correct eGFR thresholds for metformin, accurate lactic acidosis risk framing, appropriate alternative medication suggestions.
Mathematical Reasoning (from Goedel-Prover-V2, Fin-o1): Full correct derivations with self-checking on math proofs.
Forecasting (from OpenForecaster): Weighted factor analysis, historical Fed cycle references, worked numerical examples for economic questions.
Marketing (from Qwen-Marketing): Structured buyer personas, channel prioritization frameworks, SEO-specific tactical knowledge.
Multilingual (from Qwen3-EZO-8B-beta): Japanese fluency retained.

What Was Lost

Screenplay formatting (from Drama-Thinking): gen781 writes prose, not scripts. The structural template did not survive.
Deep thinking chains (from Drama-Thinking): The 3,400-token extended thinking blocks on creative tasks are gone — thinking blocks on creative prompts come back empty.

What Transformed

The Drama-Thinking fine-tune's most valuable contribution was not structure but style — atmospheric prose, emotional depth, visual storytelling, and structured narrative. These qualities survived the merge even though the screenplay format and long thinking chains did not.

Emergent Thinking Behavior

The most interesting emergent property. Base Qwen3-8B uses its thinking mode liberally. gen781 developed a higher activation threshold for thinking, almost certainly because the non-thinking fine-tunes (medical, marketing, finance, forecasting) outnumber the thinking-oriented ones (DeepSeek-R1, Goedel-Prover) in its ancestry.

The result:

Thinks for deductive tasks, math, and context disambiguation in complex multi-turn conversations
Skips thinking for direct domain knowledge retrieval (medical facts, marketing frameworks, etc.)
Explicit "think step by step" prompting does not override this — the threshold is baked into the weights, not the prompt layer

The net characterization: a broad, stable generalist that traded thinking depth for domain breadth — which for most practical use cases is the right tradeoff.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ryanfortin/community-blend-qwen3-8b"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain the mechanism of action of metformin."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Model Architecture

Parameter	Value
Architecture	Qwen3ForCausalLM
Parameters	~8B
Layers	36
Hidden Size	4096
Attention Heads	32
KV Heads	8 (GQA)
FFN Size	12288
Vocab Size	151,936
Max Context	40,960 tokens
RoPE Theta	1,000,000
Precision	bfloat16

Breeding Statistics

Metric	Value
Generation	781 of 927
Total Merges in Run	749
Total Champions	98 (13.1% rate)
Seed Pool Size	24 models
Archive Size	354
Merge Methods Used	SLERP + DELLA
Ancestor HF Models	11
Ancestor Merged Models	30+ intermediates

Complete Seed Pool

All 24 Qwen3-8B fine-tunes used in the breeding population. Models marked with * are direct ancestors of gen781 (their weights are in this model). The remaining 13 were evaluated and crossed during breeding but their genetics did not survive to gen781.

Model	Domain	License	Ancestor?
Qwen/Qwen3-8B	Base architecture	Apache 2.0	Base
TheFinAI/Fin-o1-8B	Finance/Reasoning	Apache 2.0	*
Goedel-LM/Goedel-Prover-V2-8B	Math Proofs	Apache 2.0	*
FutureMa/Qwen3-8B-Drama-Thinking	Creative Writing	Apache 2.0	*
DragonLLM/Qwen-Open-Finance-R-8B	Finance	Apache 2.0	*
Logics-MLLM/Logics-STEM-8B-SFT	STEM	Apache 2.0	*
gustavecortal/Piaget-8B	Psychology	MIT	*
AXCXEPT/Qwen3-EZO-8B-beta	Japanese/General	Apache 2.0	*
marketeam/Qwen-Marketing	Marketing	MIT	*
nikhilchandak/OpenForecaster-8B	Forecasting	MIT	*
mlxha/Qwen3-8B-grpo-medmcqa	Medical	Apache 2.0	*
OpenMOSS-Team/Qwen3-8B-ABC	General/Coding	Apache 2.0	*
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B	Reasoning	MIT
nvidia/Nemotron-Orchestrator-8B	Agentic/Orchestration	NVIDIA License
open-thoughts/OpenThinker-Agent-v1	Agents/Tools	Apache 2.0
tablegpt/TableGPT-R1	Data/SQL	Apache 2.0
allenai/SERA-8B	Repository Agents	Apache 2.0
TIGER-Lab/Critique-Coder-8B	Code Review	Apache 2.0
qihoo360/Light-IF-8B	Instruction Following	Apache 2.0
MegaScience/Qwen3-8B-MegaScience	Science	Apache 2.0
ValiantLabs/Qwen3-8B-ShiningValiant3	Instruction Following	Apache 2.0
ValiantLabs/Qwen3-8B-Esper3	Coding/Reasoning	Apache 2.0
gustavecortal/Qwen3-psychological-reasoning-8B	Psychology	MIT
yolay/SmartSnap-Qwen3-8B	Agentic	Apache 2.0

Acknowledgments

Tools and Methods:

mergekit by Charles Goddard et al. (Goddard et al., 2024) for model merging
lm-evaluation-harness by EleutherAI for evaluation
Evolutionary model merging concept from Sakana AI (Akiba et al., 2025), published in Nature Machine Intelligence
DELLA-Merging by Deep et al. for magnitude-based adaptive pruning

Base Architecture:

Qwen3-8B by the Qwen Team at Alibaba (Yang et al., 2025)

Seed Model Authors: Thank you to all 24 seed model creators whose fine-tunes made this breeding experiment possible. Special thanks to the 11 ancestor model authors whose work directly contributes to gen781's weights — see the Complete Seed Pool table above for the full list with links.

Citations

If you use this model, please cite the base architecture and the tools/methods used.

Core — Architecture and Methods

Qwen3 Technical Report — Qwen Team, 2025 — Paper | Model
Evolutionary Optimization of Model Merging Recipes — Akiba et al., 2025 — Paper (Nature Machine Intelligence) | arXiv | Sakana AI
Arcee's MergeKit: A Toolkit for Merging Large Language Models — Goddard et al., 2024 — Paper | GitHub
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling — Deep et al., 2024 — Paper
lm-evaluation-harness — EleutherAI — GitHub

Ancestor Seed Models (weights in this model)

Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance — Qian et al., 2025 — Paper | Model
Goedel-Prover-V2: Scaling Formal Theorem Proving — Lin et al., 2025 — Paper | Model
Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training — Xu et al., 2026 — Paper | Model
Scaling Open-Ended Reasoning to Predict the Future (OpenForecaster) — Chandak et al., 2025 — Paper | Model
ABC-Bench: Benchmarking Agentic Backend Coding — Yang et al., 2026 — Paper | Model
Qwen3-8B-Drama-Thinking — FutureMa — Model
Qwen-Open-Finance-R-8B — DragonLLM — Model
Piaget-8B — gustavecortal — Model
Qwen3-EZO-8B-beta — AXCXEPT — Model
Qwen-Marketing — Marketeam — Model
Qwen3-8B-grpo-medmcqa — mlxha — Model

Other Seed Pool Models with Papers (evaluated during breeding, not in final ancestry)

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI, 2025 — Paper | Model
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning — Fan et al., 2025 — Paper | Model
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration — Su et al., 2025 — Paper | Model
Light-IF: Endowing LLMs with Generalizable Reasoning — Wang et al., 2025 — Paper | Model
SERA: Soft-Verified Efficient Repository Agents — Shen et al., 2026 — Paper | Model
TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning — Yang et al., 2025 — Paper | Model
Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning — Ruan et al., 2025 — Paper | Model
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents — Cai et al., 2025 — Paper | Model
OpenThinker-Agent — OpenThoughts — Blog | Model
Qwen3-8B-ShiningValiant3 — Valiant Labs — Model
Qwen3-8B-Esper3 — Valiant Labs — Model
Qwen3-psychological-reasoning-8B — gustavecortal — Model

BibTeX (click to expand)

% Base Architecture
@misc{qwen3,
    title={Qwen3 Technical Report},
    author={Qwen Team},
    year={2025},
    eprint={2505.09388},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2505.09388}
}

% Evolutionary Model Merging
@article{akiba2025evolutionary,
    title={Evolutionary Optimization of Model Merging Recipes},
    author={Akiba, Takuya and Shing, Makoto and Tang, Yujin and Sun, Qi and Ha, David},
    journal={Nature Machine Intelligence},
    year={2025},
    doi={10.1038/s42256-024-00975-8}
}

% MergeKit
@article{goddard2024mergekit,
    title={Arcee's MergeKit: A Toolkit for Merging Large Language Models},
    author={Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vlad and Benedict, Brian and McQuade, Mark and Solawetz, Jacob},
    journal={arXiv preprint arXiv:2403.13257},
    year={2024}
}

% DELLA-Merging
@article{deep2024dellamerging,
    title={DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling},
    author={Deep, Pala Tej and Bhardwaj, Rishabh and Poria, Soujanya},
    journal={arXiv preprint arXiv:2406.11617},
    year={2024}
}

% Ancestor Seed Models
@article{qian2025fino1,
    title={Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance},
    author={Qian, Lingfei and Zhou, Weipeng and Wang, Yan and Peng, Xueqing and Huang, Jimin and Xie, Qianqian},
    journal={arXiv preprint arXiv:2502.08127},
    year={2025}
}

@article{lin2025goedelproverv2,
    title={Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction},
    author={Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and others},
    journal={arXiv preprint arXiv:2508.03613},
    year={2025}
}

@misc{xu2026logicsstem,
    title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement},
    author={Xu, Mingyu and Fang, Cheng and Jiang, Keyue and Zheng, Yuqian and Xiao, Yanghua and Zhou, Baojian and Zhao, Qifang and Zheng, Suhang and Zhu, Xiuwen and Tang, Jiyang and others},
    year={2026},
    eprint={2601.01562},
    archivePrefix={arXiv}
}

@article{chandak2025openforecaster,
    title={Scaling Open-Ended Reasoning to Predict the Future},
    author={Chandak, Nikhil and Goel, Shashwat and Prabhu, Ameya and Hardt, Moritz and Geiping, Jonas},
    journal={arXiv preprint arXiv:2512.25070},
    year={2025}
}

@misc{yang2026abcbench,
    title={ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development},
    author={Yang, Jie and Guo, Honglin and Ji, Li and Zhou, Jiazheng and Zheng, Rui and Lei, Zhikai and Zhang, Shuo and Xi, Zhiheng and Liu, Shichun and Wang, Yuxin and others},
    year={2026},
    eprint={2601.11077},
    archivePrefix={arXiv}
}

% Other Seed Pool Models
@misc{deepseekai2025deepseekr1,
    title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
    author={DeepSeek-AI},
    year={2025},
    eprint={2501.12948},
    archivePrefix={arXiv}
}

@article{fan2025megascience,
    title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning},
    author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei},
    journal={arXiv preprint arXiv:2507.16812},
    year={2025}
}

@misc{su2025toolorchestra,
    title={ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
    author={Su, Hongjin and Diao, Shizhe and Lu, Ximing and Liu, Mingjie and Xu, Jiacheng and Dong, Xin and Fu, Yonggan and others},
    year={2025},
    eprint={2511.21689},
    archivePrefix={arXiv}
}

@misc{wang2025lightif,
    title={Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following},
    author={Wang, Chenyang and Wen, Liang and Jia, Shousheng and Zhang, Xiangzheng and Xu, Liang},
    year={2025},
    eprint={2508.03178},
    archivePrefix={arXiv}
}

@misc{shen2026sera,
    title={SERA: Soft-Verified Efficient Repository Agents},
    author={Shen, Ethan and Tormoen, Danny and Shah, Saurabh and Farhadi, Ali and Dettmers, Tim},
    year={2026},
    eprint={2601.20789},
    archivePrefix={arXiv}
}

@misc{yang2025tablegptr1,
    title={TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning},
    author={Yang, Saisai and Huang, Qingyi and Yuan, Jing and Zha, Liangyu and Tang, Kai and others},
    year={2025},
    eprint={2512.20312},
    archivePrefix={arXiv}
}

@article{ruan2025critiquecoder,
    title={Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning},
    author={Ruan, Chi and Jiang, Dongfu and Wang, Yubo and Chen, Wenhu},
    journal={arXiv preprint arXiv:2509.22824},
    year={2025}
}

@article{cai2025smartsnap,
    title={SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents},
    author={Cai, Shaofei and Qin, Yulei and Lin, Haojia and Xu, Zihan and Li, Gang and Shi, Yuchen and others},
    journal={arXiv preprint arXiv:2512.22322},
    year={2025}
}

License

This model inherits the license terms of its constituent models. All 11 ancestor models use permissive licenses:

Apache 2.0: Qwen3-8B (base), Fin-o1-8B, Goedel-Prover-V2-8B, Drama-Thinking, Finance-R, Logics-STEM, EZO-8B-beta, medmcqa, Qwen3-8B-ABC
MIT: Piaget-8B, Qwen-Marketing, OpenForecaster-8B

Both Apache 2.0 and MIT are permissive licenses that allow commercial use, modification, and redistribution. This model is released under Apache 2.0 following the base Qwen3-8B license.