Sefer-270M-Base (60K Checkpoint)

A 270M parameter hybrid language model combining CFDRA (Convolutional Frequency-Domain Recurrent Architecture) layers with Transformer attention. This is an experimental architecture exploring alternatives to pure attention-based models.

Model Description

Sefer-270M is a decoder-only language model that uses a novel hybrid architecture:

CFDRA Layers: Sequence mixing via frequency-domain convolutions with damped oscillator modes
Transformer Attention: Standard grouped-query attention (GQA) for long-range dependencies
Layer Ratio: 15 CFDRA layers + 5 Attention layers (3:1 ratio)

Key Innovation: Frozen Decay Parameters

The CFDRA layers use "decay parameters" that control the half-life of different frequency modes. In this model, these parameters are frozen at their initial geometric distribution, which preserves a diverse range of time scales from short-term (1 token) to long-term (2048 tokens).

Architecture Details

Component	Value
Parameters	~269M
Hidden Size (d_model)	768
Layers	20 (15 CFDRA + 5 Attention)
Attention Heads	12
KV Heads (GQA)	4
FFN Expansion	4x
Vocab Size	151,936
Context Length	2,048
Tokenizer	Qwen/Qwen2.5-1.5B

CFDRA Configuration

Parameter	Value
R (branches)	48
M (modes per branch)	384
Kernel Length	2,048
Chunk Size	512
Decay Constraint	Frozen (initial geometric distribution)

Training Details

Dataset

Dataset: FineWeb (sample-100BT)
Training Tokens: ~14.7B tokens (60K steps × 245,760 tokens/step)
Target: 49B tokens (200K steps) - training ongoing

Training Configuration

Batch Size: 120 effective (20 per device × 6 gradient accumulation)
Sequence Length: 2,048
Learning Rate: 6e-4 (cosine schedule with 2K warmup)
Optimizer: AdamW (β1=0.9, β2=0.95)
Weight Decay: 0.1
Precision: bfloat16

Training Progress

Checkpoint	Steps	Tokens	Eval Loss
This model	60,000	~14.7B	3.265

Usage

Installation

This model requires the custom CFDRA implementation. Clone the repository:

git clone https://github.com/fractal-agi/tcfdra-sefer.git
cd tcfdra-sefer
pip install -r requirements.txt

Loading the Model

import torch
from src.model.tcfdra_moe import TCFDRAConfig, TCFDRAModel

# Load config
config = TCFDRAConfig(
    d_model=768,
    vocab_size=151936,
    n_layers=20,
    cfdra_ratio=3,
    use_attention=True,
    R=48,
    M=384,
    kernel_len=2048,
    chunk_size=512,
    n_heads=12,
    n_kv_heads=4,
    ffn_expansion=4,
    dropout=0.0,
    freeze_decay=True,
)

# Create model and load weights
model = TCFDRAModel(config)
state_dict = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

Text Generation

prompt = "The history of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=100,
        temperature=0.8,
        top_p=0.9,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Intended Use

This is a base model (not instruction-tuned). It is designed for:

Research on hybrid architectures (CFDRA + Transformer)
Fine-tuning for specific downstream tasks
Studying the behavior of frequency-domain sequence modeling

Limitations

Not instruction-tuned: Will not follow instructions or chat naturally
Limited training: Only 60K steps (~14.7B tokens) - still improving
Experimental architecture: May have unexpected behaviors
English only: Primarily trained on English web text

Model Architecture Diagram

Input Tokens
     ↓
[Embedding Layer]
     ↓
┌─────────────────────────────────────┐
│  Layer 0-2: CFDRA + FFN (×3)        │
│  Layer 3: Attention + FFN           │
│  Layer 4-6: CFDRA + FFN (×3)        │
│  Layer 7: Attention + FFN           │
│  ...                                │
│  Layer 16-18: CFDRA + FFN (×3)      │
│  Layer 19: Attention + FFN          │
└─────────────────────────────────────┘
     ↓
[RMSNorm]
     ↓
[LM Head] → Logits

What is CFDRA?

CFDRA (Convolutional Frequency-Domain Recurrent Architecture) is a sequence modeling layer that:

Uses damped oscillators: Each mode has a frequency and decay rate, creating a natural multi-scale representation
Operates in frequency domain: Leverages FFT for efficient long-range convolutions
Has inherent positional information: The decay structure provides implicit position encoding
Enables kernel-state duality: Can operate as either a convolution (parallel) or RNN (sequential)

Key Advantages

O(n log n) complexity for sequence length n (vs O(n²) for attention)
Built-in multi-scale modeling: Different modes capture different time scales
No explicit positional embeddings needed: Position information emerges from decay structure

Citation

If you use this model, please cite:

@misc{sefer270m2025,
  title={Sefer-270M: A Hybrid CFDRA-Transformer Language Model},
  author={Fractal AGI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/fractal-agi/sefer-270m-base-60k}
}

License

Apache 2.0

Contact

Organization: Fractal AGI
Repository: GitHub

Downloads last month: 18

Model tree for fractal-agi/sefer-270m-base-60k

Finetunes

1 model

fractal-agi
/

sefer-270m-base-60k