Sefer-270M-Base (60K Checkpoint)

A 270M parameter hybrid language model combining CFDRA (Convolutional Frequency-Domain Recurrent Architecture) layers with Transformer attention. This is an experimental architecture exploring alternatives to pure attention-based models.

Model Description

Sefer-270M is a decoder-only language model that uses a novel hybrid architecture:

  • CFDRA Layers: Sequence mixing via frequency-domain convolutions with damped oscillator modes
  • Transformer Attention: Standard grouped-query attention (GQA) for long-range dependencies
  • Layer Ratio: 15 CFDRA layers + 5 Attention layers (3:1 ratio)

Key Innovation: Frozen Decay Parameters

The CFDRA layers use "decay parameters" that control the half-life of different frequency modes. In this model, these parameters are frozen at their initial geometric distribution, which preserves a diverse range of time scales from short-term (1 token) to long-term (2048 tokens).

Architecture Details

Component Value
Parameters ~269M
Hidden Size (d_model) 768
Layers 20 (15 CFDRA + 5 Attention)
Attention Heads 12
KV Heads (GQA) 4
FFN Expansion 4x
Vocab Size 151,936
Context Length 2,048
Tokenizer Qwen/Qwen2.5-1.5B

CFDRA Configuration

Parameter Value
R (branches) 48
M (modes per branch) 384
Kernel Length 2,048
Chunk Size 512
Decay Constraint Frozen (initial geometric distribution)

Training Details

Dataset

  • Dataset: FineWeb (sample-100BT)
  • Training Tokens: ~14.7B tokens (60K steps ร— 245,760 tokens/step)
  • Target: 49B tokens (200K steps) - training ongoing

Training Configuration

  • Batch Size: 120 effective (20 per device ร— 6 gradient accumulation)
  • Sequence Length: 2,048
  • Learning Rate: 6e-4 (cosine schedule with 2K warmup)
  • Optimizer: AdamW (ฮฒ1=0.9, ฮฒ2=0.95)
  • Weight Decay: 0.1
  • Precision: bfloat16

Training Progress

Checkpoint Steps Tokens Eval Loss
This model 60,000 ~14.7B 3.265

Usage

Installation

This model requires the custom CFDRA implementation. Clone the repository:

git clone https://github.com/fractal-agi/tcfdra-sefer.git
cd tcfdra-sefer
pip install -r requirements.txt

Loading the Model

import torch
from src.model.tcfdra_moe import TCFDRAConfig, TCFDRAModel

# Load config
config = TCFDRAConfig(
    d_model=768,
    vocab_size=151936,
    n_layers=20,
    cfdra_ratio=3,
    use_attention=True,
    R=48,
    M=384,
    kernel_len=2048,
    chunk_size=512,
    n_heads=12,
    n_kv_heads=4,
    ffn_expansion=4,
    dropout=0.0,
    freeze_decay=True,
)

# Create model and load weights
model = TCFDRAModel(config)
state_dict = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

Text Generation

prompt = "The history of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=100,
        temperature=0.8,
        top_p=0.9,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Intended Use

This is a base model (not instruction-tuned). It is designed for:

  • Research on hybrid architectures (CFDRA + Transformer)
  • Fine-tuning for specific downstream tasks
  • Studying the behavior of frequency-domain sequence modeling

Limitations

  • Not instruction-tuned: Will not follow instructions or chat naturally
  • Limited training: Only 60K steps (~14.7B tokens) - still improving
  • Experimental architecture: May have unexpected behaviors
  • English only: Primarily trained on English web text

Model Architecture Diagram

Input Tokens
     โ†“
[Embedding Layer]
     โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Layer 0-2: CFDRA + FFN (ร—3)        โ”‚
โ”‚  Layer 3: Attention + FFN           โ”‚
โ”‚  Layer 4-6: CFDRA + FFN (ร—3)        โ”‚
โ”‚  Layer 7: Attention + FFN           โ”‚
โ”‚  ...                                โ”‚
โ”‚  Layer 16-18: CFDRA + FFN (ร—3)      โ”‚
โ”‚  Layer 19: Attention + FFN          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ†“
[RMSNorm]
     โ†“
[LM Head] โ†’ Logits

What is CFDRA?

CFDRA (Convolutional Frequency-Domain Recurrent Architecture) is a sequence modeling layer that:

  1. Uses damped oscillators: Each mode has a frequency and decay rate, creating a natural multi-scale representation
  2. Operates in frequency domain: Leverages FFT for efficient long-range convolutions
  3. Has inherent positional information: The decay structure provides implicit position encoding
  4. Enables kernel-state duality: Can operate as either a convolution (parallel) or RNN (sequential)

Key Advantages

  • O(n log n) complexity for sequence length n (vs O(nยฒ) for attention)
  • Built-in multi-scale modeling: Different modes capture different time scales
  • No explicit positional embeddings needed: Position information emerges from decay structure

Citation

If you use this model, please cite:

@misc{sefer270m2025,
  title={Sefer-270M: A Hybrid CFDRA-Transformer Language Model},
  author={Fractal AGI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/fractal-agi/sefer-270m-base-60k}
}

License

Apache 2.0

Contact

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for fractal-agi/sefer-270m-base-60k

Finetunes
1 model

Dataset used to train fractal-agi/sefer-270m-base-60k

Space using fractal-agi/sefer-270m-base-60k 1