NN Arch Components
updated
STEM: Scaling Transformers with Embedding Modules
Paper
•
2601.10639
•
Published
•
1
Paper
•
2601.00417
•
Published
•
34
mHC: Manifold-Constrained Hyper-Connections
Paper
•
2512.24880
•
Published
•
292
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper
•
2512.14531
•
Published
•
14
Stronger Normalization-Free Transformers
Paper
•
2512.10938
•
Published
•
20
Gated Attention for Large Language Models: Non-linearity, Sparsity, and
Attention-Sink-Free
Paper
•
2505.06708
•
Published
•
10
Transformers without Normalization
Paper
•
2503.10622
•
Published
•
170
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
•
2503.02130
•
Published
•
32
Paper
•
2409.19606
•
Published
•
26
Paper
•
2511.11238
•
Published
•
38