============================================================
HUMANEVAL BENCHMARK COMPLETE

26 days ago

Because it is lobotomized and Lora trained version of Qwen3-4B-Thinking-2507. They erased the identity. Also, I am conducting independent benchmark tests, and the results are VERY far from what they claim... Not to mention they call it "3B" when the model is a 4billion parameter (lies even in the name...). I wouldnt even be surprised if all the "success downloads" are artificial too...

Nanbeige LLM Lab org 26 days ago

@Nerdsking

We would like to clarify a few points:

a) Nanbeige4 is trained completely from scratch and has no relation to Qwen. The architecture, number of layers, mlp width, and tokenizer are all different.

b) Our “3B” refers to the non-embedding parameter count (~3.1B). For comparison, Qwen3-4B has ~3.6B non-embedding parameters.
Using non-embedding parameters for naming is not unusual. For example:
Qwen3-30B → ~31B total parameters
Qwen3-0.6B → ~0.8B total parameters
Qwen3-32B → ~33B total parameters

c) Regarding benchmark claims — please share your evaluation datasets, inference hyperparameters, and full results. We are not aware of any publicly reproducible evaluation supporting this claim.

d) As for the suggestion that our download numbers are artificial — this is simply false.
We have never engaged in any form of artificial download activity.
Please refrain from making defamatory accusations without evidence.

We welcome constructive discussion and transparent comparisons, and are happy to run side-by-side evaluations under shared settings.

26 days ago

I will share a complete report on your model very soon when all my benchmark tests are done and double checked in 2 or 3 days. But as for now, the results are very far from what I would expect based on whats being marketed...

xTimeCrystal

26 days ago

the results are very far from what I would expect based on whats being marketed...

Indeed, when I benchmarked the model on AIME 2026 II, the model's results were very far from what was marketed (87.4% on AIME I vs 100% on AIME II)! 😆😆

Some points for improvement:

Used ~1M tokens total to solve the 15 problems.
Failed publicly unavailable set of "simple-for-humans" combinatorics problems from HK IMO TST. To be fair, not even frontier models like GPT-5.2-Pro Extended Thinking can solve these yet!
Failed SimpleBench (which is known to correlate with general model capabilities)

Thank you to the 南北隔 team for such an impressive model!

26 days ago

@Nerdsking Can you benchmark my model when it's done? lol

26 days ago

@Nymbo
No, I've often had models report on 'who' they are, though sometimes they do claim to be the base model that they've been ~derived from (multx have claimed 'llama'), or the code assist environment that they're running in (though I suspect that's from the ~template they're provided : this one does appear to have some model related information in chinese in its template ~conditions!?) or sometimes they aren't aware that they've been upgraded when a new version comes out...

lc2004

26 days ago

前排吃瓜

26 days ago

•

@Nerdsking Can you benchmark my model when it's done? lol

In fact I created a tool, it is at github, so you can do it yourself and do not fall for marketing:
https://github.com/nerdskingcom/gguf-humaneval-benchmark

But you can wait more 3 or 4 days, because I am rebuilding the script to run many other benchmarks too. And you just need to use pip to install. Easy as that.

Also, some geniouses are talking on ONE MILLION TOKENS to get a benchmark done...

Fact is that anyone can make a model to "loop eternally" to get things done. This is not smart, it is "brute force". And impractical.

For comparison, my model (Nerdsking python coder 3b i - A REAL 3B parameter model, not a 4B marketed as 3B...), achieves 88,41% in HumanEval, in MINUTES, with NO THINKING mode. This model will take MORE THAN 6 HOURS (using FP16) WITH THE SAME HARDWARE to answer the 160 questions of the HumanEval bench. SIMPLE QUESTIONS! Imagine this running a 50K code file? How long would it take? A week or a month?

AND HERE IS THE RESULT, CONGRATULATIONS!!!

============================================================
HUMANEVAL BENCHMARK COMPLETE

Model : Nanbeige4.1-3B.f16.gguf
Pass@1 : 92.68% (152/164)
Total time : 22263.80s --- MORE THAN SIX HOURS!!! (in a RTX 5060 ti 16gb)
Avg/task : 135.75s
Results : /home/x4245/bench_gguf/results_Nanbeige4.1-3B.f16_20260219_121044.json

This 4 billion model TOOK 6 HOURS to reach 92,68% in HumanEval Benchmark, spending ENOURMOUS RESOURCES and TOKENS, to answer 160 SIMPLE QUESTIONS! While my model, "Nerdsking Python Coder 3 B i" makes 88,41% IN MINUTES (EXACTLY 210.51s - THREE MINUTES AND HALF ), WITH NO THINKING MODE!

Sincerelly, I will not even waste my precious time doing other coding benchmarks in this model. For SERIOUS CODING, it is simply unusable.

Finally, about the the claiming of having "trained from zero" the model, this would have costed millions. And no, it is not the case. The model answers SIMILARLY as Qwen, because IT IS a variant based in QWEN. Have the same context, striped 4 layers, but uses even the same icons. Shame on you.

But yes, I am forced to confess that the marketing part is really good. A bunch of people downloading a model that would take a year to finish a real program of medium complexity. LOL

Nanbeige LLM Lab org 26 days ago

@Nerdsking

Enough with the repeated “Qwen wrapper” accusation. 😂

Nanbeige4 is trained entirely from scratch. Your “remove four layers” statement demonstrates a misunderstanding of transformer architecture.

Let’s be explicit:

• Nanbeige does not use QK-Norm. Qwen does.
• The attention Q projection and output projection matrix shapes are different.
• The MLP shapes are different.
• The tokenizer and the embedding shape are different.
• The layer count differs.

So please stop repeating unfounded accusations.

As for your self-trained Python-only model — since you seem very confident — how does it perform on difficult python coding evaluations such as LiveCodeBench or LiveCodeBench Pro? How does it perform on recent LeetCode Weekly Contest problems?

And beyond narrow Python tasks — how does it perform on broader capabilities?

• Academic QA
• Mathematical reasoning
• Creative writing
• Human preference alignment
• Tool-use ability

Before publicly attacking another model, you should be prepared to compare comprehensively — not selectively.

lc2004

26 days ago

@Nerdsking
Interesting results! A few questions:

What benchmark settings were used for each model? (temperature, max_tokens, thinking mode on/off, etc.)

Your model achieves 88.41% in 3.5 min - but it's specifically trained for Python coding only, while Nanbeige4.1-3B is a general-purpose model. Is this a fair comparison?

Can you share the full results JSON so others can verify?

Curious to see what happens when Nanbeige is benchmarked WITHOUT thinking mode.

Finally,as a man passing by .to be fair, the models of nanbeige are not fake, and they are very excellent, but you subtly implied that they inflated download numbers and marketed them. I think you should apologize to them

Nanbeige LLM Lab org 26 days ago

Regarding the “overthinking” issue:
In this version, we intentionally pushed performance to explore the capability limits of a small model. As a result, longer reasoning traces can occur in some scenarios.
Technically, this mainly stems from the absence of explicit length regularization during the General RL stage. The optimization objective emphasized final answer quality, without sufficiently constraining reasoning verbosity. This is a known trade-off rather than an accident.

As discussed previously in the following threads:
https://huggingface.co/Nanbeige/Nanbeige4.1-3B/discussions/15
https://huggingface.co/Nanbeige/Nanbeige4.1-3B/discussions/17
We will introduce length-aware optimization in the coming Nanbeige4.2 to address this.

Efficiency and capability should not be mutually exclusive, and improving both is an active focus for the next release.
🙏

lc2004

26 days ago

•

No description provided.

26 days ago

Can we all agree that you should all follow me? Is that something we can agree on?

26 days ago

•

I used the parameters indicated by Nambeige page (0.6 temperature), and for the sake of science, the exact same for me. But if you are curious I developed a Humaneval benchmark that you can run yourself. Do not take my word for it.

As for being a "fair comparison", I cannot compare (for now) in other benchmarks, but in about 3 or 4 days I will update my benchmark script. It will be stop being HumanEval specific, to be multi-benchmark. So "YOU" could run ALL those benchmarks yourself. (I am not doing that because Nambeige proved to be too slow and computationally inefficient. I will not dewelve in "why", because it is not my problem). So I compared with the tool I had at hand.

I tried to share the JSON, but this platform does not allows. But I ask you to wait just more 3 or 4 days, when my new benchmark tool is ready, as you can do the benchmarks yourself, and the script generates the model FULL thinking in case the model misses the question (so you can check "why"), or just the correct answer when it passes.

The Nambeige generated a 824KB json file. It wasted 16K of context and halucinated in the second question and missed another 11.

Here you have the beggining of the json file:

{
"status": "complete",
"model": "Nanbeige4.1-3B.f16.gguf",
"model_path": "/home/x4245/hf_models/Nanbeige4.1-3B.f16.gguf",
"config_used": "/home/x4245/bench_gguf/bench_config.json",
"parameters": {
"ctx_size": 16384,
"n_gpu_layers": 99,
"temp": 0.6,
"mirostat": 0,
"cache_type_k": "f16",
"cache_type_v": "f16",
"server_path": "/home/x4245/llama.cpp/build/bin",
"port": 8080,
"host": "127.0.0.1",
"server_url": null,
"no_server": false
},
"benchmark": {
"samples": 164,
"tasks_completed": 164,
"thinking": false,
"accuracy": 92.6829268292683,
"passed": 152,
"total": 164,
"total_time": 22263.804171323776,
"avg_time_per_task": 135.75490348368157
},
"results": [
{
"task_id": "HumanEval/0",
"passed": true,
"prompt": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n """ Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n """\n",
"messages": "[{"role": "system", "content": "You are a Python coding assistant. Your response MUST contain exactly one fenced Python code block with the complete implementation:\npython\\n# your implementation here\\n\nDo not include explanations or additional code blocks outside the fence."}, {"role": "user", "content": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n"}]",
"full_response": "python\nfrom typing import List\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n if len(numbers) < 2:\n return False\n \n sorted_nums = sorted(numbers)\n for i in range(len(sorted_nums) - 1):\n if sorted_nums[i + 1] - sorted_nums[i] < threshold:\n return True\n return False\n",
"generated_code": "from typing import List\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n """ Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n """\n if len(numbers) < 2:\n return False\n \n sorted_nums = sorted(numbers)\n for i in range(len(sorted_nums) - 1):\n if sorted_nums[i + 1] - sorted_nums[i] < threshold:\n return True\n return False",
"code_source": "content",
"error_type": null,
"error_detail": null
},

@Nerdsking
Interesting results! A few questions:

What benchmark settings were used for each model? (temperature, max_tokens, thinking mode on/off, etc.)

Your model achieves 88.41% in 3.5 min - but it's specifically trained for Python coding only, while Nanbeige4.1-3B is a general-purpose model. Is this a fair comparison?

Can you share the full results JSON so others can verify?

Curious to see what happens when Nanbeige is benchmarked WITHOUT thinking mode.

Finally,as a man passing by .to be fair, the models of nanbeige are not fake, and they are very excellent, but you subtly implied that they inflated download numbers and marketed them. I think you should apologize to them

26 days ago

•

@Nerdsking

Enough with the repeated “Qwen wrapper” accusation. 😂

Nanbeige4 is trained entirely from scratch. Your “remove four layers” statement demonstrates a misunderstanding of transformer architecture.

Let’s be explicit:

• Nanbeige does not use QK-Norm. Qwen does.
• The attention Q projection and output projection matrix shapes are different.
• The MLP shapes are different.
• The tokenizer and the embedding shape are different.
• The layer count differs.

So please stop repeating unfounded accusations.

As for your self-trained Python-only model — since you seem very confident — how does it perform on difficult python coding evaluations such as LiveCodeBench or LiveCodeBench Pro? How does it perform on recent LeetCode Weekly Contest problems?

And beyond narrow Python tasks — how does it perform on broader capabilities?

• Academic QA
• Mathematical reasoning
• Creative writing
• Human preference alignment
• Tool-use ability

Before publicly attacking another model, you should be prepared to compare comprehensively — not selectively.

You are free to say whatever please you. But the content is too "Qwenish" to be a mere coincidence. Even the context size and icons are the same. And that's nothing bad with that. I also used Qwen as base. Why reinvent the weel? I just think you have to give credit where credit is due...

But if you insist in saying that you created FROM ZERO your model... ok...

As for not "understandig transformer architecture" this is a "nice argument" for the outsiders reading this. But we both know what I was talking about (I hope...).

About other benchmarks, I was going to do that, but your model is so SLOW and computationaly expensive that it ruined any desire from me in doing so. But as my new benchmark tool will be ready in some days, you're welcome to do the comparison (or with other models). I won't.

I am currently focused in something much deeper than transformer based models. But in a complete new way on how to create AI, without the stupid self-regression system. This is not intelligence at all. It's a mere prediction. I want an abstraction first, human like model.

Anyway, I should have not stopped to talk about this or your model. I do not even give a damn about my own model. It is old technology like everything else.

But I wish you good luck in your endeavour. Just make it more efficient. For coding, it is not presently useful.

xTimeCrystal

26 days ago

Hi @leran1995 , are you the pre-training lead at Nanbeige? I’m a huge fan of Nanbeige 4 3B and had two quick questions about your training pipeline:

Optimizer choice: Is there a particular reason the team didn't use Muon during pre-training?
Data composition: Will you consider open-sourcing the pre-training corpus?

Given the model's exceptional performance, I am very curious to study the data composition. If released, I genuinely believe it would be the best open-source dataset available right now. Amazing work!

Nanbeige LLM Lab org 26 days ago

@xTimeCrystal

Yes, thanks for the support and interest! And regarding your questions:

Optimizer – Nanbeige4-3B started quite early, around September last year, so we didn’t use Muon back then. For upcoming versions—likely Nanbeige4.5 or Nanbeige5—we’re planning to adopt Muon.

Data composition – Open-sourcing the pre-training corpus isn’t on the roadmap right now, but we’re committed to improving the model and keeping it open-source.

tanyiades

25 days ago

Hey @leran1995 , cool points on Nanbeige4.1‑3B.

That said, @Nerdsking ’s concerns totally make sense, especially given his LLM experience.

Nanbeige4.1‑3B is clearly different from Qwen, no QK-Norm, different MLP & attention setup, tokenizer & embeddings, and number of layers.

That doesn’t automatically mean it was trained from scratch, though. big changes like this still require solid retraining to work well.

From my experience running large LLMs on HGX B200 NVL8 (B300 on the way), it’s reasonable for the community to ask for training logs, weight details, or reproducible benchmarks.

Healthy skepticism keeps the open-source world strong 😎

Nanbeige LLM Lab org 25 days ago

@tanyiades
We feel it is really ridiculous that people engaging in "feeling-based speculation" those who have already listed concrete architectural differences to provide further evidence. 😂😂

Anyway, to stop this ridiculous guessing, we release the full pre-training curves (loss vs. steps) across all five stages of the Nanbeige4-3B-Base

• Warm-up + diversity-enriched stable stage

• Knowledge-enriched stable stage

• 4k context decay stage

• 32k context decay stage

• 64k context decay stage

If anyone believes this model is derived from Qwen, you’re welcome to modify Qwen’s architecture to match ours (projection shapes, MLP widths, remove QK-Norm, change tokenizer, embeddings, layer count) and attempt to “just fine-tune” it into coherence. We have never performed any Qwen-based continued pretraining or LoRA adaptation. You are also free to use any other method to provide evidence that our model was incrementally trained from Qwen — but I can guarantee that the conclusion will be that our weights are entirely unrelated to Qwen. We welcome scrutiny. But accusations require evidence.🙏🙏

Let’s stop this ridiculous guessing, move forward, and focus on building stronger models.💪💪

leran1995 changed discussion status to closed 25 days ago

24 days ago

•

edited 24 days ago

My own creation of, and postings in, this topic are statements of fact. I think that other postings might better be moved to a different topic.

I can say, however, that when I ask 'Nanbeige4.1-3B'/8bit "Who is 'BOSS Zhipin'?", it gladly goes on for considerable detail (over my provided limit of 2048 tokens) about what it thinks it knows about the Chinese recruitment platform of that name (including some speculation) before being cut off. I also asked it "Does 'BOSS Zhipin' produce any LLMs?", and it thrashes about considerably (though including "...Does BOSS Zhipin develop their own LLMs? Unlikely to be a primary focus..."). [ FWIW, in checking the 'BOSS Zhipin' web site, and iPhone app, there doesn't appear to yet be an english interface available ]

In both answers it starts out with "Weimplify is asked:...", but when queried on it doesn't seem to know who or what "Weimplify" might stand for?

24 days ago

@Nerdsking

My model is out if you want to check it out! :D
It scores really high on eqbench

24 days ago

@leran1995 I think you should be offering to run 'side-by-side' results in contrast to mine, when in/on my topic of origination, not attempting to ~censor the topic itself by claiming third party mis-speculation.

24 days ago

•

edited 24 days ago

@Nerdsking

My model is out if you want to check it out! :D
It scores really high on eqbench

My new benchmark tool will be ready in 2 days . You can follow my profile or download it from github or simply install it using pip. The current benchmark tool can be downloaded from https://github.com/nerdskingcom/gguf-humaneval-benchmark (or installed using "pip install gguf-humaneval-benchmark "), but I recommend waiting for the new tool.

It will have AIME, GPQA, HumanEval, LiveCodeBench Pro Easy and LiveCodeBench v6. As you may noticed, my focus is coding and AI engineering. Despite this, all those tools could be excellent to analyze the model reasoning capabilities, being the model a coder or not.

About your model, the story made by it, "The Echoes of Joy and Sorrow", shows nice creativity and is fluid. But as I found in many other models, it lacks a "signature", a style. This happens possibly when many authors are used to train a model. It can be somehow circumvented by feeding a long text from a specific author and asking the model to write using the same style, but it carries limitations as it is 'imitation', not a true skill. I would love to see " shakespearean" writers and other specialized models. If I am not wrong, a person from huggingface created a specific russian writer based model, I guess it worths to test it.

But aside this, congratulations for your work.

24 days ago

@Nerdsking

My model is out if you want to check it out! :D
It scores really high on eqbench

My new benchmark tool will be ready in 2 days . You can follow my profile or download it from github or simply install it using pip. The current benchmark tool can be downloaded from https://github.com/nerdskingcom/gguf-humaneval-benchmark (or installed using "pip install gguf-humaneval-benchmark "), but I recommend waiting for the new tool.

It will have AIME, GPQA, HumanEval, LiveCodeBench Pro Easy and LiveCodeBench v6. As you may noticed, my focus is coding and AI engineering. Despite this, all those tools could be excellent to analyze the model reasoning capabilities, being the model a coder or not.

About your model, the story made by it, "The Echoes of Joy and Sorrow", shows nice creativity and is fluid. But as I found in many other models, it lacks a "signature", a style. This happens possibly when many authors are used to train a model. It can be somehow circumvented by feeding a long text from a specific author and asking the model to write using the same style, but it carries limitations as it is 'imitation', not a true skill. I would love to see " shakespearean" writers and other specialized models. If I am not wrong, a person from huggingface created a specific russian writer based model, I guess it worths to test it.

But aside this, congratulations for your work.

Sounds awesome!

21 days ago

@Nerdsking ,
@crownelius
FWIW, I don't appreciate how your off-topic, foundless speculation appears to have caused the 'Nanbeige LLM Lab' moderator to close / censor this topic.

21 days ago

@Nerdsking ,
@crownelius
FWIW, I don't appreciate how your off-topic, foundless speculation appears to have caused the 'Nanbeige LLM Lab' moderator to close / censor this topic.

Don't include me in this. I started this topic.

20 days ago

@crownelius No, the only topic I see you having started was #24; HF doesn't support ~'subtopics' or whatever your feverish imagination appears may have ~hallucinated to justify your ~spamming in/of my topic.

AI-Enjoyer05829394

11 days ago

Hey! I really like Nanbeige, do you guys have something like API? My PC is not strong enough so I was wondering if I could pay to use it