YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

RNJ-1: Building from Scratch

A complete PyTorch implementation of the RNJ-1 (pronounced "range-1") architecture, following the design principles of the model developed by Essential AI, led by Ashish Vaswani (co-author of "Attention Is All You Need").

Overview

This is a configurable implementation of the RNJ-1 architecture that allows you to build models of various sizes. The original RNJ-1 is an 8.3B parameter model optimized for code generation, agentic tasks, and STEM problem solving, but this implementation lets you adjust the model size based on your needs and available resources.

The actual parameter count is calculated at runtime and depends on your configuration in RNJ1_CONFIG. The default configuration targets the full RNJ-1 architecture (~8.3B), but you can easily modify it to create smaller models for testing or limited GPU memory.

Key Facts

Parameters: Fully configurable - actual count calculated at runtime (see RNJ1_CONFIG in rnj1.py)
Default Config: Targets ~8.3B parameters (can be modified for smaller models)
Context Length: Configurable (default 32K tokens)
License: Apache 2.0
Architecture: Based on Gemma 3, with key simplifications
Original Model: Essential AI (led by Transformer co-inventor Ashish Vaswani)

Architecture

Model Specifications

The model configuration is defined in RNJ1_CONFIG in rnj1.py. You can modify these values to create models of any size:

Hyperparameter	Default Value	Config Key	Notes
Number of Layers	32	`n_layers`	Main size factor
Model Dimension	4096	`emb_dim`	Affects all layers
MLP Dimension	16384	`hidden_dim`	Typically 4x emb_dim
Number of Attention Heads	32	`n_heads`	Should divide emb_dim
Number of Key-Value Heads	8	`n_kv_groups`	GQA ratio
Attention Head Dimension	128	`head_dim`	Typically emb_dim/n_heads
Vocabulary Size	From tokenizer	`vocab_size`	Affects embedding size
Tokenizer	SentencePiece BPE	Auto-detected	With fallback options
Context Length	32768	`context_length`	Can be reduced
Activation Function	GeGLU	Fixed	In FeedForward class
Tied Embeddings	Yes	Fixed	Embedding and output head share weights

Important:

Total Parameters: Calculated automatically from the config above using count_parameters() function and printed when you run the script
The actual parameter count is printed when you run the script - look for: Total trainable parameters: X,XXX,XXX (~X.XXB)
To create a smaller model, modify RNJ1_CONFIG in rnj1.py before running
The embedding layer size = vocab_size × emb_dim, which can be significant
Example: Reducing emb_dim from 4096 to 1024 and n_layers from 32 to 12 creates a much smaller model

Key Architectural Features

Global Attention Only: Unlike Gemma 3's hybrid sliding window + global attention, RNJ-1 uses only global attention throughout all layers. This provides full context awareness at every layer, which is beneficial for code and agentic tasks.
Standard RoPE: Uses single RoPE (Rotary Position Embeddings) with theta_base = 10,000. Context extension from 8K to 32K is handled via YaRN (Yet another RoPE extensioN) during mid-training.
GeGLU Activation: Uses GeGLU (Gated GeLU) activation function in the feedforward network, which provides better expressiveness compared to standard GeLU.
Grouped Query Attention (GQA): 32 query heads with 8 KV heads (4:1 ratio), providing memory efficiency while maintaining performance.
QK Normalization: Uses query-key normalization for training stability.
4 RMSNorm Layers: Pre-norm architecture with 4 normalization layers per transformer block:
- input_layernorm (pre-attention)
- post_attention_layernorm (post-attention, pre-residual)
- pre_feedforward_layernorm (pre-feedforward)
- post_feedforward_layernorm (post-feedforward, pre-residual)

Differences from Gemma 3

Feature	Gemma 3	RNJ-1
Attention	Hybrid sliding window (5:1 pattern)	Global attention only
RoPE	Dual RoPE (10K for local, 1M for global)	Single RoPE (10K, extended via YaRN)
Activation	GeLU	GeGLU
Context Length	128K (native)	32K (extended from 8K)
Optimizer	AdamW	Muon (custom)
Focus	General-purpose	Code & STEM

Installation

Requirements

pip install torch numpy transformers datasets tqdm matplotlib

GPU Requirements

The memory requirements depend on your model configuration:

Memory requirements depend on your model configuration:

With default config (targeting ~8.3B):
- Recommended: NVIDIA A100 (40GB+) or H100
- Minimum: NVIDIA L4 (24GB) with reduced batch size
- Memory: ~35-40GB VRAM (batch_size=16, block_size=128)
For smaller models (modify RNJ1_CONFIG):
- Reduce emb_dim, n_layers, and hidden_dim in RNJ1_CONFIG
- Can run on GPUs with 8-16GB VRAM with appropriate reductions
- Example smaller config: emb_dim=1024, n_layers=12, hidden_dim=4096
- Check the printed parameter count to see your actual model size

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_id = "EssentialAI/rnj-1-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate text
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a Python function to calculate factorial"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.95
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Training from Scratch

The complete training script is provided in rnj1.py. It includes:

Dataset Loading: TinyStories dataset (ideal for small language models)
Tokenization: SentencePiece BPE tokenizer with 128K vocabulary
Model Architecture: Complete RNJ-1 implementation
Training Loop: With mixed precision, gradient accumulation, and learning rate scheduling

Training Configuration

# Training hyperparameters (from rnj1.py)
batch_size = 16
block_size = 128
learning_rate = 1e-4
max_iters = 150000
warmup_steps = 1000
gradient_accumulation_steps = 32

Model Configuration

The model size is determined by RNJ1_CONFIG in the script. The actual parameter count is calculated at runtime and printed during initialization. To create a smaller model, modify the configuration:

# Example: Smaller model for testing
RNJ1_CONFIG = {
    "vocab_size": vocab_size,  # From tokenizer
    "emb_dim": 1024,           # Reduced from 4096
    "n_heads": 16,             # Reduced from 32
    "head_dim": 64,            # Reduced from 128
    "n_kv_groups": 4,          # Reduced from 8
    "n_layers": 12,            # Reduced from 32
    "hidden_dim": 4096,        # Reduced from 16384
    "context_length": 2048,    # Reduced from 32K
    "rope_base": 10_000.0,
    "qk_norm": True,
    "dtype": torch.bfloat16,
}

Running Training

python rnj1.py

What the script does:

Loads tokenizer (with fallback options if RNJ-1 tokenizer unavailable)
Downloads and tokenizes TinyStories dataset (if train.bin doesn't exist)
Initializes model with RNJ1_CONFIG settings
Prints actual parameter count (this is the real model size!)
Trains with mixed precision (bfloat16)
Saves best model based on validation loss (rnj1_model.pt)
Generates sample text after training

To see your actual model size, look for this output when running:

Total trainable parameters: X,XXX,XXX (~X.XXB)

Model Components

The implementation includes:

RoPE (Rotary Position Embeddings): Standard implementation with configurable base frequency
RMSNorm: Zero-centered weights with (1 + weight) scaling
GroupedQueryAttention: GQA with QK normalization
FeedForward: GeGLU-based feedforward network
TransformerBlock: Complete transformer block with 4 normalization layers
Rnj1Model: Full model with token embeddings, transformer blocks, and output head

Performance

Benchmarks

Code Generation:

HumanEval+: Strong performance
MBPP+: Strong performance
BigCodeBench: Strong performance
SWE-bench: 20.8% (exceptional for 8B model)

Mathematical Reasoning:

GSM8K: Strong performance
Minerva-MATH: On par with best models
AIME: Outperforms or matches best models

STEM:

GPQA-Diamond: Close to best similarly sized models
SuperGPQA: Strong long-context reasoning

Implementation Details

Tokenizer

Type: SentencePiece BPE
Vocabulary Size: 128,000 tokens
Loading: Uses EssentialAI/rnj-1 tokenizer with fallback options

Data Type Handling

Training: bfloat16 (preferred) or float16
Token IDs: uint32 (required for vocab_size > 65536)
Mixed Precision: Automatic via torch.amp.autocast

Memory Optimization

Gradient Accumulation: Simulates larger batch size without more memory
Mixed Precision: Reduces memory usage
Gradient Checkpointing: Can be added for further memory savings

Key Features

Complete Implementation: All components from scratch in PyTorch
Training Ready: Full training loop with best practices
Modular Design: Easy to modify and extend
Well Documented: Inline comments explaining each component
Production Ready: Includes evaluation, checkpointing, and text generation

Limitations & Notes

Model Size: The actual parameter count is calculated and printed at runtime. The default RNJ1_CONFIG targets ~8.3B parameters, but:
- The actual size depends on vocab_size (from tokenizer) and all config values
- You can modify RNJ1_CONFIG to create much smaller models
- For testing, many users reduce emb_dim, n_layers, and hidden_dim significantly
- The embedding layer (vocab_size × emb_dim) is often the largest component
Optimizer: This implementation uses AdamW, but the original RNJ-1 uses Muon optimizer (custom optimizer by Essential AI). Muon provides superior token efficiency but is not publicly available.
Training Scale: The provided script uses TinyStories dataset for demonstration. Full RNJ-1 training requires:
- 8.4T tokens for pre-training (8K context)
- 380B tokens for context extension (8K → 32K)
- 150B tokens for supervised fine-tuning
Memory Requirements: Memory usage depends on model size. For the full 8.3B model, you need significant GPU memory. For smaller models, adjust batch_size and block_size based on available hardware. You can also reduce model dimensions in RNJ1_CONFIG.
Tokenizer Fallback: If RNJ-1 tokenizer is unavailable, the script falls back to Llama 3.1 tokenizer (also 128K vocab, SentencePiece BPE). The actual vocab_size affects the embedding layer size significantly.

File Structure

rnj-1/
├── README.md                    # This file
├── rnj1.py                      # Complete training script
├── RNJ1_QUICK_REFERENCE.md      # Quick reference guide
├── RNJ1_REVIEW.md               # Detailed model review
├── RNJ1_TOKENIZER_INFO.md       # Tokenizer details
├── RNJ1_VS_GEMMA3_COMPARISON.md # Architecture comparison
└── linkedin_post_rnj1.md        # Social media post about implementation

References

Essential AI Research Blog: essential.ai/research/rnj-1
Hugging Face Model: EssentialAI/rnj-1
Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Gemma 3: Architecture base for RNJ-1
Google Collab: https://colab.research.google.com/drive/1kwnLGHCDLXjeztkDoOuAS90dQIz2TgjU?usp=sharing

License

This implementation follows the Apache 2.0 license, matching the original RNJ-1 model.

Acknowledgments

Essential AI for releasing the open-weight RNJ-1 model
Ashish Vaswani and team for the Transformer architecture and RNJ-1 development
Hugging Face for model hosting and transformers library
TinyStories dataset creators for providing training data

Contributing

This is an educational implementation. For improvements or corrections:

Check existing documentation files for details
Verify against official RNJ-1 specifications
Test on appropriate hardware
Document any changes

Questions & Support

For questions about:

Model Architecture: See RNJ1_REVIEW.md and RNJ1_VS_GEMMA3_COMPARISON.md
Tokenizer: See RNJ1_TOKENIZER_INFO.md
Quick Usage: See RNJ1_QUICK_REFERENCE.md
Implementation Details: See inline comments in rnj1.py

Last Updated: December 2025
Model Version: RNJ-1 (Base and Instruct)
Implementation Version: 1.0

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support