YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

RNJ-1: Building from Scratch

A complete PyTorch implementation of the RNJ-1 (pronounced "range-1") architecture, following the design principles of the model developed by Essential AI, led by Ashish Vaswani (co-author of "Attention Is All You Need").

Overview

This is a configurable implementation of the RNJ-1 architecture that allows you to build models of various sizes. The original RNJ-1 is an 8.3B parameter model optimized for code generation, agentic tasks, and STEM problem solving, but this implementation lets you adjust the model size based on your needs and available resources.

The actual parameter count is calculated at runtime and depends on your configuration in RNJ1_CONFIG. The default configuration targets the full RNJ-1 architecture (~8.3B), but you can easily modify it to create smaller models for testing or limited GPU memory.

Key Facts

  • Parameters: Fully configurable - actual count calculated at runtime (see RNJ1_CONFIG in rnj1.py)
  • Default Config: Targets ~8.3B parameters (can be modified for smaller models)
  • Context Length: Configurable (default 32K tokens)
  • License: Apache 2.0
  • Architecture: Based on Gemma 3, with key simplifications
  • Original Model: Essential AI (led by Transformer co-inventor Ashish Vaswani)

Architecture

Model Specifications

The model configuration is defined in RNJ1_CONFIG in rnj1.py. You can modify these values to create models of any size:

Hyperparameter Default Value Config Key Notes
Number of Layers 32 n_layers Main size factor
Model Dimension 4096 emb_dim Affects all layers
MLP Dimension 16384 hidden_dim Typically 4x emb_dim
Number of Attention Heads 32 n_heads Should divide emb_dim
Number of Key-Value Heads 8 n_kv_groups GQA ratio
Attention Head Dimension 128 head_dim Typically emb_dim/n_heads
Vocabulary Size From tokenizer vocab_size Affects embedding size
Tokenizer SentencePiece BPE Auto-detected With fallback options
Context Length 32768 context_length Can be reduced
Activation Function GeGLU Fixed In FeedForward class
Tied Embeddings Yes Fixed Embedding and output head share weights

Important:

  • Total Parameters: Calculated automatically from the config above using count_parameters() function and printed when you run the script
  • The actual parameter count is printed when you run the script - look for: Total trainable parameters: X,XXX,XXX (~X.XXB)
  • To create a smaller model, modify RNJ1_CONFIG in rnj1.py before running
  • The embedding layer size = vocab_size Γ— emb_dim, which can be significant
  • Example: Reducing emb_dim from 4096 to 1024 and n_layers from 32 to 12 creates a much smaller model

Key Architectural Features

  1. Global Attention Only: Unlike Gemma 3's hybrid sliding window + global attention, RNJ-1 uses only global attention throughout all layers. This provides full context awareness at every layer, which is beneficial for code and agentic tasks.

  2. Standard RoPE: Uses single RoPE (Rotary Position Embeddings) with theta_base = 10,000. Context extension from 8K to 32K is handled via YaRN (Yet another RoPE extensioN) during mid-training.

  3. GeGLU Activation: Uses GeGLU (Gated GeLU) activation function in the feedforward network, which provides better expressiveness compared to standard GeLU.

  4. Grouped Query Attention (GQA): 32 query heads with 8 KV heads (4:1 ratio), providing memory efficiency while maintaining performance.

  5. QK Normalization: Uses query-key normalization for training stability.

  6. 4 RMSNorm Layers: Pre-norm architecture with 4 normalization layers per transformer block:

    • input_layernorm (pre-attention)
    • post_attention_layernorm (post-attention, pre-residual)
    • pre_feedforward_layernorm (pre-feedforward)
    • post_feedforward_layernorm (post-feedforward, pre-residual)

Differences from Gemma 3

Feature Gemma 3 RNJ-1
Attention Hybrid sliding window (5:1 pattern) Global attention only
RoPE Dual RoPE (10K for local, 1M for global) Single RoPE (10K, extended via YaRN)
Activation GeLU GeGLU
Context Length 128K (native) 32K (extended from 8K)
Optimizer AdamW Muon (custom)
Focus General-purpose Code & STEM

Installation

Requirements

pip install torch numpy transformers datasets tqdm matplotlib

GPU Requirements

The memory requirements depend on your model configuration:

Memory requirements depend on your model configuration:

  • With default config (targeting ~8.3B):

    • Recommended: NVIDIA A100 (40GB+) or H100
    • Minimum: NVIDIA L4 (24GB) with reduced batch size
    • Memory: ~35-40GB VRAM (batch_size=16, block_size=128)
  • For smaller models (modify RNJ1_CONFIG):

    • Reduce emb_dim, n_layers, and hidden_dim in RNJ1_CONFIG
    • Can run on GPUs with 8-16GB VRAM with appropriate reductions
    • Example smaller config: emb_dim=1024, n_layers=12, hidden_dim=4096
    • Check the printed parameter count to see your actual model size

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_id = "EssentialAI/rnj-1-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate text
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a Python function to calculate factorial"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output_ids = model.generate(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.95
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Training from Scratch

The complete training script is provided in rnj1.py. It includes:

  1. Dataset Loading: TinyStories dataset (ideal for small language models)
  2. Tokenization: SentencePiece BPE tokenizer with 128K vocabulary
  3. Model Architecture: Complete RNJ-1 implementation
  4. Training Loop: With mixed precision, gradient accumulation, and learning rate scheduling

Training Configuration

# Training hyperparameters (from rnj1.py)
batch_size = 16
block_size = 128
learning_rate = 1e-4
max_iters = 150000
warmup_steps = 1000
gradient_accumulation_steps = 32

Model Configuration

The model size is determined by RNJ1_CONFIG in the script. The actual parameter count is calculated at runtime and printed during initialization. To create a smaller model, modify the configuration:

# Example: Smaller model for testing
RNJ1_CONFIG = {
    "vocab_size": vocab_size,  # From tokenizer
    "emb_dim": 1024,           # Reduced from 4096
    "n_heads": 16,             # Reduced from 32
    "head_dim": 64,            # Reduced from 128
    "n_kv_groups": 4,          # Reduced from 8
    "n_layers": 12,            # Reduced from 32
    "hidden_dim": 4096,        # Reduced from 16384
    "context_length": 2048,    # Reduced from 32K
    "rope_base": 10_000.0,
    "qk_norm": True,
    "dtype": torch.bfloat16,
}

Running Training

python rnj1.py

What the script does:

  1. Loads tokenizer (with fallback options if RNJ-1 tokenizer unavailable)
  2. Downloads and tokenizes TinyStories dataset (if train.bin doesn't exist)
  3. Initializes model with RNJ1_CONFIG settings
  4. Prints actual parameter count (this is the real model size!)
  5. Trains with mixed precision (bfloat16)
  6. Saves best model based on validation loss (rnj1_model.pt)
  7. Generates sample text after training

To see your actual model size, look for this output when running:

Total trainable parameters: X,XXX,XXX (~X.XXB)

Model Components

The implementation includes:

  • RoPE (Rotary Position Embeddings): Standard implementation with configurable base frequency
  • RMSNorm: Zero-centered weights with (1 + weight) scaling
  • GroupedQueryAttention: GQA with QK normalization
  • FeedForward: GeGLU-based feedforward network
  • TransformerBlock: Complete transformer block with 4 normalization layers
  • Rnj1Model: Full model with token embeddings, transformer blocks, and output head

Performance

Benchmarks

Code Generation:

  • HumanEval+: Strong performance
  • MBPP+: Strong performance
  • BigCodeBench: Strong performance
  • SWE-bench: 20.8% (exceptional for 8B model)

Mathematical Reasoning:

  • GSM8K: Strong performance
  • Minerva-MATH: On par with best models
  • AIME: Outperforms or matches best models

STEM:

  • GPQA-Diamond: Close to best similarly sized models
  • SuperGPQA: Strong long-context reasoning

Implementation Details

Tokenizer

  • Type: SentencePiece BPE
  • Vocabulary Size: 128,000 tokens
  • Loading: Uses EssentialAI/rnj-1 tokenizer with fallback options

Data Type Handling

  • Training: bfloat16 (preferred) or float16
  • Token IDs: uint32 (required for vocab_size > 65536)
  • Mixed Precision: Automatic via torch.amp.autocast

Memory Optimization

  • Gradient Accumulation: Simulates larger batch size without more memory
  • Mixed Precision: Reduces memory usage
  • Gradient Checkpointing: Can be added for further memory savings

Key Features

  1. Complete Implementation: All components from scratch in PyTorch
  2. Training Ready: Full training loop with best practices
  3. Modular Design: Easy to modify and extend
  4. Well Documented: Inline comments explaining each component
  5. Production Ready: Includes evaluation, checkpointing, and text generation

Limitations & Notes

  1. Model Size: The actual parameter count is calculated and printed at runtime. The default RNJ1_CONFIG targets ~8.3B parameters, but:

    • The actual size depends on vocab_size (from tokenizer) and all config values
    • You can modify RNJ1_CONFIG to create much smaller models
    • For testing, many users reduce emb_dim, n_layers, and hidden_dim significantly
    • The embedding layer (vocab_size Γ— emb_dim) is often the largest component
  2. Optimizer: This implementation uses AdamW, but the original RNJ-1 uses Muon optimizer (custom optimizer by Essential AI). Muon provides superior token efficiency but is not publicly available.

  3. Training Scale: The provided script uses TinyStories dataset for demonstration. Full RNJ-1 training requires:

    • 8.4T tokens for pre-training (8K context)
    • 380B tokens for context extension (8K β†’ 32K)
    • 150B tokens for supervised fine-tuning
  4. Memory Requirements: Memory usage depends on model size. For the full 8.3B model, you need significant GPU memory. For smaller models, adjust batch_size and block_size based on available hardware. You can also reduce model dimensions in RNJ1_CONFIG.

  5. Tokenizer Fallback: If RNJ-1 tokenizer is unavailable, the script falls back to Llama 3.1 tokenizer (also 128K vocab, SentencePiece BPE). The actual vocab_size affects the embedding layer size significantly.

File Structure

rnj-1/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ rnj1.py                      # Complete training script
β”œβ”€β”€ RNJ1_QUICK_REFERENCE.md      # Quick reference guide
β”œβ”€β”€ RNJ1_REVIEW.md               # Detailed model review
β”œβ”€β”€ RNJ1_TOKENIZER_INFO.md       # Tokenizer details
β”œβ”€β”€ RNJ1_VS_GEMMA3_COMPARISON.md # Architecture comparison
└── linkedin_post_rnj1.md        # Social media post about implementation

References

  1. Essential AI Research Blog: essential.ai/research/rnj-1
  2. Hugging Face Model: EssentialAI/rnj-1
  3. Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
  4. Gemma 3: Architecture base for RNJ-1
  5. Google Collab: https://colab.research.google.com/drive/1kwnLGHCDLXjeztkDoOuAS90dQIz2TgjU?usp=sharing

License

This implementation follows the Apache 2.0 license, matching the original RNJ-1 model.

Acknowledgments

  • Essential AI for releasing the open-weight RNJ-1 model
  • Ashish Vaswani and team for the Transformer architecture and RNJ-1 development
  • Hugging Face for model hosting and transformers library
  • TinyStories dataset creators for providing training data

Contributing

This is an educational implementation. For improvements or corrections:

  1. Check existing documentation files for details
  2. Verify against official RNJ-1 specifications
  3. Test on appropriate hardware
  4. Document any changes

Questions & Support

For questions about:

  • Model Architecture: See RNJ1_REVIEW.md and RNJ1_VS_GEMMA3_COMPARISON.md
  • Tokenizer: See RNJ1_TOKENIZER_INFO.md
  • Quick Usage: See RNJ1_QUICK_REFERENCE.md
  • Implementation Details: See inline comments in rnj1.py

Last Updated: December 2025
Model Version: RNJ-1 (Base and Instruct)
Implementation Version: 1.0

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support