EasyGPT-303M (Trained on OpenWebText)
A 303M parameter GPT-2 style model trained from scratch on the OpenWebText dataset.
Reaching a validation loss of 2.887, comparable to GPT-2 Medium.
1. Model Introduction
This is a Decoder-only Transformer language model trained using Andrej Karpathy's nanoGPT framework. We integrated new components such as RMSNorm, Rotary Positional Embeddings (RoPE), SwiGLU, and GQA. It was trained from scratch on the OpenWebText dataset, which is an open-source reproduction of the dataset used to train OpenAI's GPT-2.
Key Specifications
| Attribute | Value |
|---|---|
| Parameters | 303 Million (comparable to GPT-2 Medium) |
| Architecture | GPT-2 (1024 context window, RoPE/Standard embeddings) |
| Dataset | OpenWebText (~17GB cleaned) |
| Tokenizer | GPT-2 BPE (via tiktoken) |
| Training Steps | 15,000 steps |
| Batch Size | ~0.5M tokens per step (Gradient Accumulation) |
| Total Tokens | ~7.3 Billion tokens |
| Final Val Loss | 2.887 (PPL 18.0) |
Training Details
- Hardware: Single NVIDIA RTX 3090 (24GB VRAM)
- Optimizer: AdamW
- Learning Rate: Peak 3.2e-4 with Cosine Decay (warmup 800 steps)
- Precision: BF16 (bfloat16) mixed precision
Capabilities
As a Base Model (not instruction-tuned), it excels at:
- Text Completion: Coherent story generation and article writing.
- In-Context Learning: Can perform tasks (like sentiment analysis) given a few examples.
- Syntax & Structure: Produces grammatically correct English with high consistency.
2. How to Use
Since this model is based on nanoGPT and uses a custom checkpoint format (.pt), you need the original model definition to load it.You can refer to https://github.com/ssyzhang/EasyGPT
3. License
This project is licensed under the MIT License. See the LICENSE file for the full license text.