RNJ-1: Building from Scratch
A complete PyTorch implementation of the RNJ-1 (pronounced "range-1") architecture, following the design principles of the model developed by Essential AI, led by Ashish Vaswani (co-author of "Attention Is All You Need").
Overview
This is a configurable implementation of the RNJ-1 architecture that allows you to build models of various sizes. The original RNJ-1 is an 8.3B parameter model optimized for code generation, agentic tasks, and STEM problem solving, but this implementation lets you adjust the model size based on your needs and available resources.
The actual parameter count is calculated at runtime and depends on your configuration in RNJ1_CONFIG. The default configuration targets the full RNJ-1 architecture (~8.3B), but you can easily modify it to create smaller models for testing or limited GPU memory.
Key Facts
- Parameters: Fully configurable - actual count calculated at runtime (see
RNJ1_CONFIGinrnj1.py) - Default Config: Targets ~8.3B parameters (can be modified for smaller models)
- Context Length: Configurable (default 32K tokens)
- License: Apache 2.0
- Architecture: Based on Gemma 3, with key simplifications
- Original Model: Essential AI (led by Transformer co-inventor Ashish Vaswani)
Architecture
Model Specifications
The model configuration is defined in RNJ1_CONFIG in rnj1.py. You can modify these values to create models of any size:
| Hyperparameter | Default Value | Config Key | Notes |
|---|---|---|---|
| Number of Layers | 32 | n_layers |
Main size factor |
| Model Dimension | 4096 | emb_dim |
Affects all layers |
| MLP Dimension | 16384 | hidden_dim |
Typically 4x emb_dim |
| Number of Attention Heads | 32 | n_heads |
Should divide emb_dim |
| Number of Key-Value Heads | 8 | n_kv_groups |
GQA ratio |
| Attention Head Dimension | 128 | head_dim |
Typically emb_dim/n_heads |
| Vocabulary Size | From tokenizer | vocab_size |
Affects embedding size |
| Tokenizer | SentencePiece BPE | Auto-detected | With fallback options |
| Context Length | 32768 | context_length |
Can be reduced |
| Activation Function | GeGLU | Fixed | In FeedForward class |
| Tied Embeddings | Yes | Fixed | Embedding and output head share weights |
Important:
- Total Parameters: Calculated automatically from the config above using
count_parameters()function and printed when you run the script - The actual parameter count is printed when you run the script - look for:
Total trainable parameters: X,XXX,XXX (~X.XXB) - To create a smaller model, modify
RNJ1_CONFIGinrnj1.pybefore running - The embedding layer size =
vocab_size Γ emb_dim, which can be significant - Example: Reducing
emb_dimfrom 4096 to 1024 andn_layersfrom 32 to 12 creates a much smaller model
Key Architectural Features
Global Attention Only: Unlike Gemma 3's hybrid sliding window + global attention, RNJ-1 uses only global attention throughout all layers. This provides full context awareness at every layer, which is beneficial for code and agentic tasks.
Standard RoPE: Uses single RoPE (Rotary Position Embeddings) with
theta_base = 10,000. Context extension from 8K to 32K is handled via YaRN (Yet another RoPE extensioN) during mid-training.GeGLU Activation: Uses GeGLU (Gated GeLU) activation function in the feedforward network, which provides better expressiveness compared to standard GeLU.
Grouped Query Attention (GQA): 32 query heads with 8 KV heads (4:1 ratio), providing memory efficiency while maintaining performance.
QK Normalization: Uses query-key normalization for training stability.
4 RMSNorm Layers: Pre-norm architecture with 4 normalization layers per transformer block:
input_layernorm(pre-attention)post_attention_layernorm(post-attention, pre-residual)pre_feedforward_layernorm(pre-feedforward)post_feedforward_layernorm(post-feedforward, pre-residual)
Differences from Gemma 3
| Feature | Gemma 3 | RNJ-1 |
|---|---|---|
| Attention | Hybrid sliding window (5:1 pattern) | Global attention only |
| RoPE | Dual RoPE (10K for local, 1M for global) | Single RoPE (10K, extended via YaRN) |
| Activation | GeLU | GeGLU |
| Context Length | 128K (native) | 32K (extended from 8K) |
| Optimizer | AdamW | Muon (custom) |
| Focus | General-purpose | Code & STEM |
Installation
Requirements
pip install torch numpy transformers datasets tqdm matplotlib
GPU Requirements
The memory requirements depend on your model configuration:
Memory requirements depend on your model configuration:
With default config (targeting ~8.3B):
- Recommended: NVIDIA A100 (40GB+) or H100
- Minimum: NVIDIA L4 (24GB) with reduced batch size
- Memory: ~35-40GB VRAM (batch_size=16, block_size=128)
For smaller models (modify
RNJ1_CONFIG):- Reduce
emb_dim,n_layers, andhidden_diminRNJ1_CONFIG - Can run on GPUs with 8-16GB VRAM with appropriate reductions
- Example smaller config:
emb_dim=1024, n_layers=12, hidden_dim=4096 - Check the printed parameter count to see your actual model size
- Reduce
Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_id = "EssentialAI/rnj-1-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Generate text
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a Python function to calculate factorial"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
output_ids = model.generate(
input_ids,
max_new_tokens=200,
temperature=0.2,
top_p=0.95
)
response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Training from Scratch
The complete training script is provided in rnj1.py. It includes:
- Dataset Loading: TinyStories dataset (ideal for small language models)
- Tokenization: SentencePiece BPE tokenizer with 128K vocabulary
- Model Architecture: Complete RNJ-1 implementation
- Training Loop: With mixed precision, gradient accumulation, and learning rate scheduling
Training Configuration
# Training hyperparameters (from rnj1.py)
batch_size = 16
block_size = 128
learning_rate = 1e-4
max_iters = 150000
warmup_steps = 1000
gradient_accumulation_steps = 32
Model Configuration
The model size is determined by RNJ1_CONFIG in the script. The actual parameter count is calculated at runtime and printed during initialization. To create a smaller model, modify the configuration:
# Example: Smaller model for testing
RNJ1_CONFIG = {
"vocab_size": vocab_size, # From tokenizer
"emb_dim": 1024, # Reduced from 4096
"n_heads": 16, # Reduced from 32
"head_dim": 64, # Reduced from 128
"n_kv_groups": 4, # Reduced from 8
"n_layers": 12, # Reduced from 32
"hidden_dim": 4096, # Reduced from 16384
"context_length": 2048, # Reduced from 32K
"rope_base": 10_000.0,
"qk_norm": True,
"dtype": torch.bfloat16,
}
Running Training
python rnj1.py
What the script does:
- Loads tokenizer (with fallback options if RNJ-1 tokenizer unavailable)
- Downloads and tokenizes TinyStories dataset (if
train.bindoesn't exist) - Initializes model with
RNJ1_CONFIGsettings - Prints actual parameter count (this is the real model size!)
- Trains with mixed precision (bfloat16)
- Saves best model based on validation loss (
rnj1_model.pt) - Generates sample text after training
To see your actual model size, look for this output when running:
Total trainable parameters: X,XXX,XXX (~X.XXB)
Model Components
The implementation includes:
- RoPE (Rotary Position Embeddings): Standard implementation with configurable base frequency
- RMSNorm: Zero-centered weights with
(1 + weight)scaling - GroupedQueryAttention: GQA with QK normalization
- FeedForward: GeGLU-based feedforward network
- TransformerBlock: Complete transformer block with 4 normalization layers
- Rnj1Model: Full model with token embeddings, transformer blocks, and output head
Performance
Benchmarks
Code Generation:
- HumanEval+: Strong performance
- MBPP+: Strong performance
- BigCodeBench: Strong performance
- SWE-bench: 20.8% (exceptional for 8B model)
Mathematical Reasoning:
- GSM8K: Strong performance
- Minerva-MATH: On par with best models
- AIME: Outperforms or matches best models
STEM:
- GPQA-Diamond: Close to best similarly sized models
- SuperGPQA: Strong long-context reasoning
Implementation Details
Tokenizer
- Type: SentencePiece BPE
- Vocabulary Size: 128,000 tokens
- Loading: Uses
EssentialAI/rnj-1tokenizer with fallback options
Data Type Handling
- Training: bfloat16 (preferred) or float16
- Token IDs: uint32 (required for vocab_size > 65536)
- Mixed Precision: Automatic via
torch.amp.autocast
Memory Optimization
- Gradient Accumulation: Simulates larger batch size without more memory
- Mixed Precision: Reduces memory usage
- Gradient Checkpointing: Can be added for further memory savings
Key Features
- Complete Implementation: All components from scratch in PyTorch
- Training Ready: Full training loop with best practices
- Modular Design: Easy to modify and extend
- Well Documented: Inline comments explaining each component
- Production Ready: Includes evaluation, checkpointing, and text generation
Limitations & Notes
Model Size: The actual parameter count is calculated and printed at runtime. The default
RNJ1_CONFIGtargets ~8.3B parameters, but:- The actual size depends on
vocab_size(from tokenizer) and all config values - You can modify
RNJ1_CONFIGto create much smaller models - For testing, many users reduce
emb_dim,n_layers, andhidden_dimsignificantly - The embedding layer (
vocab_size Γ emb_dim) is often the largest component
- The actual size depends on
Optimizer: This implementation uses AdamW, but the original RNJ-1 uses Muon optimizer (custom optimizer by Essential AI). Muon provides superior token efficiency but is not publicly available.
Training Scale: The provided script uses TinyStories dataset for demonstration. Full RNJ-1 training requires:
- 8.4T tokens for pre-training (8K context)
- 380B tokens for context extension (8K β 32K)
- 150B tokens for supervised fine-tuning
Memory Requirements: Memory usage depends on model size. For the full 8.3B model, you need significant GPU memory. For smaller models, adjust
batch_sizeandblock_sizebased on available hardware. You can also reduce model dimensions inRNJ1_CONFIG.Tokenizer Fallback: If RNJ-1 tokenizer is unavailable, the script falls back to Llama 3.1 tokenizer (also 128K vocab, SentencePiece BPE). The actual vocab_size affects the embedding layer size significantly.
File Structure
rnj-1/
βββ README.md # This file
βββ rnj1.py # Complete training script
βββ RNJ1_QUICK_REFERENCE.md # Quick reference guide
βββ RNJ1_REVIEW.md # Detailed model review
βββ RNJ1_TOKENIZER_INFO.md # Tokenizer details
βββ RNJ1_VS_GEMMA3_COMPARISON.md # Architecture comparison
βββ linkedin_post_rnj1.md # Social media post about implementation
References
- Essential AI Research Blog: essential.ai/research/rnj-1
- Hugging Face Model: EssentialAI/rnj-1
- Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
- Gemma 3: Architecture base for RNJ-1
- Google Collab: https://colab.research.google.com/drive/1kwnLGHCDLXjeztkDoOuAS90dQIz2TgjU?usp=sharing
License
This implementation follows the Apache 2.0 license, matching the original RNJ-1 model.
Acknowledgments
- Essential AI for releasing the open-weight RNJ-1 model
- Ashish Vaswani and team for the Transformer architecture and RNJ-1 development
- Hugging Face for model hosting and transformers library
- TinyStories dataset creators for providing training data
Contributing
This is an educational implementation. For improvements or corrections:
- Check existing documentation files for details
- Verify against official RNJ-1 specifications
- Test on appropriate hardware
- Document any changes
Questions & Support
For questions about:
- Model Architecture: See
RNJ1_REVIEW.mdandRNJ1_VS_GEMMA3_COMPARISON.md - Tokenizer: See
RNJ1_TOKENIZER_INFO.md - Quick Usage: See
RNJ1_QUICK_REFERENCE.md - Implementation Details: See inline comments in
rnj1.py
Last Updated: December 2025
Model Version: RNJ-1 (Base and Instruct)
Implementation Version: 1.0
- Downloads last month
- 37