Sefer-270M-Base (60K Checkpoint)
A 270M parameter hybrid language model combining CFDRA (Convolutional Frequency-Domain Recurrent Architecture) layers with Transformer attention. This is an experimental architecture exploring alternatives to pure attention-based models.
Model Description
Sefer-270M is a decoder-only language model that uses a novel hybrid architecture:
- CFDRA Layers: Sequence mixing via frequency-domain convolutions with damped oscillator modes
- Transformer Attention: Standard grouped-query attention (GQA) for long-range dependencies
- Layer Ratio: 15 CFDRA layers + 5 Attention layers (3:1 ratio)
Key Innovation: Frozen Decay Parameters
The CFDRA layers use "decay parameters" that control the half-life of different frequency modes. In this model, these parameters are frozen at their initial geometric distribution, which preserves a diverse range of time scales from short-term (1 token) to long-term (2048 tokens).
Architecture Details
| Component | Value |
|---|---|
| Parameters | ~269M |
| Hidden Size (d_model) | 768 |
| Layers | 20 (15 CFDRA + 5 Attention) |
| Attention Heads | 12 |
| KV Heads (GQA) | 4 |
| FFN Expansion | 4x |
| Vocab Size | 151,936 |
| Context Length | 2,048 |
| Tokenizer | Qwen/Qwen2.5-1.5B |
CFDRA Configuration
| Parameter | Value |
|---|---|
| R (branches) | 48 |
| M (modes per branch) | 384 |
| Kernel Length | 2,048 |
| Chunk Size | 512 |
| Decay Constraint | Frozen (initial geometric distribution) |
Training Details
Dataset
- Dataset: FineWeb (sample-100BT)
- Training Tokens: ~14.7B tokens (60K steps ร 245,760 tokens/step)
- Target: 49B tokens (200K steps) - training ongoing
Training Configuration
- Batch Size: 120 effective (20 per device ร 6 gradient accumulation)
- Sequence Length: 2,048
- Learning Rate: 6e-4 (cosine schedule with 2K warmup)
- Optimizer: AdamW (ฮฒ1=0.9, ฮฒ2=0.95)
- Weight Decay: 0.1
- Precision: bfloat16
Training Progress
| Checkpoint | Steps | Tokens | Eval Loss |
|---|---|---|---|
| This model | 60,000 | ~14.7B | 3.265 |
Usage
Installation
This model requires the custom CFDRA implementation. Clone the repository:
git clone https://github.com/fractal-agi/tcfdra-sefer.git
cd tcfdra-sefer
pip install -r requirements.txt
Loading the Model
import torch
from src.model.tcfdra_moe import TCFDRAConfig, TCFDRAModel
# Load config
config = TCFDRAConfig(
d_model=768,
vocab_size=151936,
n_layers=20,
cfdra_ratio=3,
use_attention=True,
R=48,
M=384,
kernel_len=2048,
chunk_size=512,
n_heads=12,
n_kv_heads=4,
ffn_expansion=4,
dropout=0.0,
freeze_decay=True,
)
# Create model and load weights
model = TCFDRAModel(config)
state_dict = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
Text Generation
prompt = "The history of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Intended Use
This is a base model (not instruction-tuned). It is designed for:
- Research on hybrid architectures (CFDRA + Transformer)
- Fine-tuning for specific downstream tasks
- Studying the behavior of frequency-domain sequence modeling
Limitations
- Not instruction-tuned: Will not follow instructions or chat naturally
- Limited training: Only 60K steps (~14.7B tokens) - still improving
- Experimental architecture: May have unexpected behaviors
- English only: Primarily trained on English web text
Model Architecture Diagram
Input Tokens
โ
[Embedding Layer]
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 0-2: CFDRA + FFN (ร3) โ
โ Layer 3: Attention + FFN โ
โ Layer 4-6: CFDRA + FFN (ร3) โ
โ Layer 7: Attention + FFN โ
โ ... โ
โ Layer 16-18: CFDRA + FFN (ร3) โ
โ Layer 19: Attention + FFN โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
[RMSNorm]
โ
[LM Head] โ Logits
What is CFDRA?
CFDRA (Convolutional Frequency-Domain Recurrent Architecture) is a sequence modeling layer that:
- Uses damped oscillators: Each mode has a frequency and decay rate, creating a natural multi-scale representation
- Operates in frequency domain: Leverages FFT for efficient long-range convolutions
- Has inherent positional information: The decay structure provides implicit position encoding
- Enables kernel-state duality: Can operate as either a convolution (parallel) or RNN (sequential)
Key Advantages
- O(n log n) complexity for sequence length n (vs O(nยฒ) for attention)
- Built-in multi-scale modeling: Different modes capture different time scales
- No explicit positional embeddings needed: Position information emerges from decay structure
Citation
If you use this model, please cite:
@misc{sefer270m2025,
title={Sefer-270M: A Hybrid CFDRA-Transformer Language Model},
author={Fractal AGI},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/fractal-agi/sefer-270m-base-60k}
}
License
Apache 2.0
Contact
- Organization: Fractal AGI
- Repository: GitHub
- Downloads last month
- 18