Model Card: Stentor-Big

Stentor-Big is a direct expansion of the original Stentor-30M model developed by Kai Izumoto (StentorLabs). The architecture was scaled up from 30M to 142M parameters by increasing the hidden size, number of layers, and intermediate dimensions while preserving the pre-trained weights where possible. This approach allows the model to retain the linguistic foundations learned by its smaller counterpart while gaining additional capacity through new randomly initialized layers.

Model Description

Stentor-Big is a compact language model with 142 million parameters, built upon the Llama architecture. It is the result of a three-stage continual pre-training process designed to combine broad linguistic competence, narrative coherence, and structured academic style. The model is intended as a strong base for further fine‑tuning or direct use in educational text generation, creative writing assistance, and prototyping of small‑scale language applications.

Developed by: stas122
Model type: Causal language model (LlamaForCausalLM)
Language: English
Parameters: 142,639,104 (142.6M)
Context length: 512 tokens
License: Apache 2.0 (the original base model Stentor‑30M is also Apache 2.0)


Model Details

Hyperparameter Value
Hidden size 512
Intermediate size 2048
Number of layers 30
Number of attention heads 8
Head dimension 64
Vocabulary size 32768
Max position embeddings 512
Tie word embeddings True
Activation function SiLU (Swish)

The architecture follows the standard LLaMA design, with pre‑RMSNorm, rotary positional embeddings (RoPE), and SwiGLU activation in the MLP.


Training Data

The model was trained in three distinct stages, each using a different corpus to shape its capabilities.

Stage Dataset Tokens Purpose
1 FineWeb (educational subset) 279M Establish general linguistic knowledge, grammar, and basic facts.
2 Custom curated fanfiction corpus 1.03B Develop narrative flow, dialogue, and literary coherence.
3 Mixed educational corpus (FineWeb‑Edu + Sciphi Textbooks) 1.02B Enhance academic style, structured exposition, and scientific reasoning.

Total tokens seen during training: ~2.33 billion, which corresponds to 81% of the Chinchilla‑optimal compute for a 142M model (≈2.88B tokens).

All datasets were pre‑tokenized with a context window of 512 tokens, using the same tokenizer as Stentor‑30M (vocabulary size 32768). Examples are chunked as 511 tokens plus an end‑of‑sequence token.


Training Procedure

Stage 1 (FineWeb)

  • Learning rate: 2e‑4
  • Effective batch size: 128
  • Steps: ~9,500
  • Hardware: 1× RTX 4090

Stage 2 (Fanfiction)

  • Learning rate: 2e‑4
  • Effective batch size: 128
  • Steps: ~6,500
  • Hardware: 2× RTX 4090 (DDP)

Stage 3 (Educational Mix)

  • Learning rate: 1.5e‑4 (reduced to preserve previously learned style)
  • Effective batch size: 192
  • Steps: ~7,171
  • Hardware: 2× RTX 4090 (DDP)

Common hyperparameters: AdamW optimizer, cosine learning rate schedule with warmup (5% of total steps), gradient clipping (1.0), weight decay (0.01), BF16 mixed precision, gradient checkpointing enabled.


Evaluation Results

The model was evaluated on a custom master test consisting of 100 prompts across 10 categories, designed to probe factual knowledge, reasoning, creativity, and domain adaptation. Each prompt is a sentence fragment that the model continues. Temperature and max new tokens were varied per category to obtain meaningful outputs.

Category Score Observations
Science (biology, physics) Excellent Accurate terminology, structured explanations (e.g., photosynthesis, black holes).
Philosophy & abstract thought Excellent Coherent, essay‑like discussions of meaning of life, consciousness, justice.
Interdisciplinary topics Excellent Synthesizes concepts from multiple domains; good use of examples.
Literature & arts Moderate Surface‑level knowledge, sometimes generic or repetitive.
Geography & facts Moderate Often produces plausible but imprecise statements; capitals and dates are weak.
Programming (Python) Moderate Recognises libraries and basic constructs, but generated code is often broken.
Mathematics Poor Frequent errors in elementary arithmetic and geometry; repetitive patterns.

Full Evaluation Results

The complete set of 100 test prompts, along with detailed generation outputs for each category and temperature setting, is available in the repository:

📊 Evaluation Results — full generation log with per-prompt temperature and token limits.

Representative Generations

Good example – science
Prompt: Photosynthesis is the process by which plants
Generation:

convert sunlight, water, and carbon dioxide into glucose and oxygen. This process is essential for the survival of all life on Earth.

  1. The light‑dependent reactions of photosynthesis involve the conversion of light energy into chemical energy, which is then used to power the plant's growth and development.

Good example – philosophy
Prompt: The meaning of life is a question that
Generation:

has been debated throughout history. It is the question that has been debated for centuries, with the philosophers of Aristotle and Plato contributing to the field.

III. The Socratic method in philosophy is a method of questioning the very foundations of life. It is based on the belief that life is a form of divine truth, and that individuals must confront the unknown and make meaningful choices that shape their existence.

Weak example – factual error
Prompt: The capital of France is
Generation:

a complex and intricate system, with a rich history that spans thousands of years. The empire was a powerful force in the world of politics, and it was during this time that the empire began

Weak example – mathematical mistake
Prompt: The square root of 64 is
Generation:

In conclusion, the 1st Grade Mathematics: Learning Addition and Subtraction through Fun Math Games chapter aims to provide a comprehensive and engaging approach to teaching these fundamental mathematical concepts.

Comparison with Stentor‑30M

On the same science prompt, the improvement in coherence and factual accuracy is evident:

Stentor‑30M:

Photosynthesis is the process by which plants and animals develop. It is the process of growth and development of the organism. The process of growth is the process of growth.

Stentor‑Big:

Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. This process is essential for the survival of all life on Earth.


Capabilities and Limitations

Capabilities

  • Generate fluent, well‑structured paragraphs in academic, scientific, and narrative styles.
  • Produce coherent continuations for prompts requiring explanation, analysis, or creative expansion.
  • Serve as a strong foundation for further fine‑tuning on domain‑specific tasks (e.g., question answering, summarisation, story generation).
  • Operate efficiently on consumer hardware (e.g., RTX 4090) and can be quantized for edge deployment.

Limitations

  • Factual accuracy is not guaranteed: the model often produces plausible‑sounding but incorrect facts, especially for dates, numbers, and proper names.
  • Weak at mathematics and programming: elementary arithmetic is frequently wrong, and generated code is rarely executable.
  • Limited context window (512 tokens): not suitable for tasks requiring long‑range dependencies.
  • Not instruction‑tuned: prompts are treated as continuations; the model does not reliably follow commands or engage in dialogue.
  • May exhibit repetitive or degenerate outputs on certain prompts (e.g., recursive definitions, simple primes).
  • Potential biases inherited from training data (web‑crawled and synthetic corpora) – use with appropriate caution.

Usage Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "stas122/Stentor-Big"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Automatically choose the right dtype and device
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=dtype,
    attn_implementation="sdpa"  # efficient attention
)

# Move to GPU if available
if torch.cuda.is_available():
    model = model.to("cuda")

prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.5,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ethical Considerations and Recommended Use

Stentor‑Big is a research prototype. It is not aligned with human preferences and may generate content that is factually incorrect, biased, or otherwise unsuitable for production environments. Users should validate all outputs, especially when used in educational or decision‑making contexts. The model should not be deployed in applications where harm could arise from erroneous or misleading text.

Recommended uses:

  • Educational tool for exploring language model behaviour.
  • Creative writing aid (with human oversight).
  • Starting point for fine‑tuning on specialised corpora.
  • Lightweight experimentation in resource‑constrained settings.

Citation

If you use this model in your work, please cite it as:

@misc{stentor-big,
  author = {stas122},
  title = {Stentor-Big: A 142M Parameter Language Model},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/stas122/Stentor-Big}}
}

Acknowledgements

I would like to express my deepest gratitude to Kai Izumoto (StentorLabs). His work on the original Stentor-30M model not only served as the foundation for this project but also sparked my initial interest in deep learning and ultimately inspired me to embark on this entire journey of language model research. This work stands on the shoulders of his contribution to the open‑source community.

  • Hugging Face for the transformers and datasets libraries.
  • The creators of FineWeb, FineWeb‑Edu, Cosmopedia v2, and Sciphi Textbooks.
  • The open‑source community for enabling accessible NLP research.
  • DeepSeek for insightful discussions and assistance with theoretical aspects of model architecture, training strategies, and evaluation methodologies.
  • Immers.cloud for providing reliable GPU infrastructure (NVIDIA RTX 4090) that made the extensive training experiments possible.
  • MLP fan community for their creations
Downloads last month
1,479
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stas122/Stentor-Big

Finetuned
(2)
this model

Datasets used to train stas122/Stentor-Big