https://huggingface.co/nightmedia/Qwen3-4B-Element8-Eva-Xiaolong-Heretic

Meanwhile this one is very good, the "hottest" I got the 30B by itself, without weird models, so if you have the bandwidth to quant it, it would do well. In the Element series I saw at least 10 point of arc increase with every merge, so it is possible to hit 0.6, just need the right combination of smarts :)

https://huggingface.co/nightmedia/Qwen3-30B-A3B-Element7-1M

RichardErkhov

8 days ago

https://hf.tst.eu/model#Qwen3-30B-A3B-Element7-1M-GGUF

RichardErkhov

8 days ago

I have enough bandwidth to download all of your models and MoE them if you tell me how =)

nightmedia

8 days ago

•

edited 8 days ago

Ah yeah, the how :)

Here is the formula for Element7. I use a nuslerp in 1.6/0.4 with every merge, the proportion being according to how strong is the embed. The DASD is a "beginner" while Element6 comes from SOTA level, so it needs to dominate. 1.5/0.5 works fine when the models are fairly equal in the "brainwave". It really doesn't matter what the model knows, if it aligns cognitively, it will merge.

With every merge, the original Qwen goes away, loses its Qwen-ness so to speak, and becomes, well, an Element :)

None of this can be done without numbers. Everybody claims their model is the best. I have numbers to show that :)

qwen_moe42e7_Qwen3-30B-A3B-Element7.yaml
models:
  - model: Qwen3-30B-A3B-Element6
    parameters:
      weight: 1.6
  - model: Alibaba-Apsara/DASD-30B-A3B-Thinking-Preview
    parameters:
      weight: 0.4
merge_method: nuslerp
tokenizer_source: base
dtype: bfloat16
name: Qwen3-30B-A3B-Element7

RichardErkhov

8 days ago

hm, what if I merge all your models with weight 1 lol ?
also, what are you using to merge, I completely forgot everything since FATLLAMA-1.7T ...

nightmedia

8 days ago

•

edited 8 days ago

Believe me, I tried. It doesn't work that way.

Every merge is a fusion of sorts. The emerging model (sic.) needs to establish its own brainstem, so to speak. That's where nuslerp helps.

The first merge is always a multislerp of a few models. Look at that as the "basement". That will be used for scaffolding.

From there on I only do nuslerp of 2 models each. Tried more, doesn't work, metrics go down because of frictions.

Every step of the way I look at arc numbers. All others will follow, or not, doesn't really matter. Some models bring the other numbers down a bit, up a bit, depending on what they contribute. For example, no funny models past stage 3, so no MiroMind, no QwenLong, those were already in the basement, and introducing them late will destabilize the way the model learned about itself. The more you go to the top, add models that bring simple, structured information built on stable bases.

If I want something special, but it's not up to snuff, I merge it first with an Engineer model that is simple enough to want to learn new things, vs fusing what it wants to learn, and with that combination leveled up, I merge it into another Element.

I only use the mlx tools and merge kit

RichardErkhov

7 days ago

ah, sad...
well, good luck with everything, let me know if you need more quants =)

nightmedia

7 days ago

I noticed the "AttributeError 'list'" on the 30B, removed the "extra_special_tokens " entry from tokenizer_config.json that might be causing the issue.

RichardErkhov

7 days ago

ok, submitted to redownload =)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment