@mrs83 on Hugging Face: "In 2017, my RNNs were babbling. Today, they are hallucinating beautifully. 10…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

mrs83

posted an update 3 days ago

Post

2030

In 2017, my RNNs were babbling. Today, they are hallucinating beautifully.

10 years ago, getting an LSTM to output coherent English was a struggle.
10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating.

We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge.

The ambitious goal is to build a small instruct model with RAG and tool usage capabilities ( ethicalabs/Kurtis-EON1)

📊 The Benchmarks (Size: 400M)

For a model this size (trained on <10B tokens), the specialized performance is surprising:

*SciQ*: 73.8% 🦄 (This rivals billion-parameter models in pure fact retrieval).
*PIQA*: 62.3% (Solid physical intuition for a sub-1B model).

The Reality Check:

HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training.

We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore...").

The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity.

Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries.

🤝 Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, let’s connect!

Training diary: ethicalabs/Kurtis-EON1

maxxafits00

3 days ago

•

edited 3 days ago

Subject: Similar journey - training on extreme budget

Hi Mrs83,

Saw your post on HF about Echo-DSRN - impressive results for 400M
parameters on <10B tokens. I'm on a similar path with SQDE Q1
(My own AI model project)

My situation:

Fine-tuning Qwen2.5-Coder-7B for mobile development
1B tokens trained on Colab free (3.5h/day T4)
Hitting the same "Reasoning Wall" you mentioned

Question: Do you want to join me, training an Ai model that is about 7-8 billion parameters (Base).making it more outsmart and competitive benchmarks at coding and programming.

looking for collaborate on architecture - seeking advice
from someone who's succeeded on student-scale hardware.

Thanks for any pointers,
Maxxa

mrs83

2 days ago

interesting. Yes, as you noticed as well a few billions tokens aren't enough. SmolLM2 360M was trained on 4 trillion tokens.

but I am not sure how to explain those results on piqa and sciq:

uv run lm_eval --model hf   --model_args pretrained=models/Echo-DSRN-Small-Instruct-Kurtis,trust_remote_code=True,device_map="auto"   --tasks hellaswag,winogrande,piqa,sciq --output_path ./results_final

Tasks	Version	Filter	Metric		Value		Stderr
hellaswag	1	none	acc	↑	0.2927	±	0.0045
		none	acc_norm	↑	0.3199	±	0.0047
piqa	1	none	acc	↑	0.6230	±	0.0113
		none	acc_norm	↑	0.6202	±	0.0113
sciq	1	none	acc	↑	0.7380	±	0.0139
		none	acc_norm	↑	0.6480	±	0.0151
winogrande	1	none	acc	↑	0.5020	±	0.0141

I can share more details in this convo, but this probably uncharted territory for an hybrid RNN with 4 attention heads

mrs83

2 days ago

•

edited 2 days ago

Now available at https://huggingface.co/spaces/ethicalabs/Echo-DSRN-Small-Next-Word-Prediction ... on the shared CPU HF resources it runs slow, but on my Macbook M4 and AMD Strix Halo is blazing fast. Memory footprint is low. I am now expanding to 1B using Net2Net and today I tested a SFT run (QLoRA, 4-bit, bf16) on consumer hardware with trl with apparently no catastrophic forgetting.

In this post