---
license: mit
datasets:
- shivendrra/consolidated-datasets
language:
- en
metrics:
- perplexity
tags:
- Basemodel
- text-generation
- nlp
- custom_code
- casual-llm
library_name: transformers
---


# TinyWay-1.2.0

**TinyWay-1.2.0** is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code).
The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.

> ⚡ Trained end-to-end using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.

---

## Model Overview

| Property          | Value                                |
| ----------------- | ------------------------------------ |
| Model type        | Decoder-only Transformer (GPT-style) |
| Parameters        | **~109.6M**                          |
| Layers            | 10                                   |
| Hidden size       | 768                                  |
| Attention heads   | 12                                   |
| Context length    | 256 tokens                           |
| Activation        | GELU                                 |
| Dropout           | 0.1                                  |
| Precision         | fp16 / bf16                          |
| Weight tying      | Token embedding tied with LM head    |
| Position encoding | Learned absolute embeddings          |

---

## Training Details

### Dataset

The model was trained using **streaming data** from:

* 🌍 Web text
* 📚 Stories
* 💻 Code

via the HuggingFace dataset:

```
shivendrra/consolidated-datasets
```

Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.

---

### Tokenization

* Tokenizer: **GPT2TokenizerFast**
* Vocabulary size: **50,257**
* Special tokens:

  * `bos_token_id = eos_token_id = pad_token_id = 50256`

---

### Training Configuration

| Setting               | Value                        |
| --------------------- | ---------------------------- |
| Sequence length       | 256                          |
| Effective batch size  | 64 sequences                 |
| Optimizer             | AdamW                        |
| Learning rate         | 3e-4 (cosine decay + warmup) |
| Betas                 | (0.9, 0.95)                  |
| Weight decay          | 0.1                          |
| Gradient clipping     | 1.0                          |
| Mixed precision       | AMP (fp16 / bf16)            |
| Gradient accumulation | Yes                          |
| Training steps        | ~60k                         |
| Total tokens          | ~1B (approx)                 |

Final training loss ≈ **3.0**
Final perplexity ≈ **~20**

---

## Usage

### Load with Transformers (Custom Code Required)

This repository uses a custom model definition (`modeling_tinyway.py`).
Make sure it is available in your environment before loading.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
```

---

### Text Generation Example

```python
import torch

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Example Generations

The model demonstrates:

* ✅ Coherent sentence structure
* ✅ Narrative flow in stories
* ✅ Reasonable grammar and punctuation
* ⚠️ Occasional repetition and topic drift (expected for this scale)

This is a research-grade small LLM, not instruction-aligned by default.

---

## Limitations

* ❌ Not instruction-tuned
* ❌ Limited reasoning depth compared to large LLMs
* ❌ Context length limited to 256 tokens
* ⚠️ May hallucinate or generate inconsistent facts
* ⚠️ Training data may contain noise from web sources

Use responsibly.

---

## Intended Use

* Research experiments
* Educational purposes
* Model scaling studies
* Training pipeline benchmarking
* Custom fine-tuning experiments

Not recommended for production or safety-critical applications.

---

## Reproducibility

The model was trained using:

* Custom PyTorch training loop
* Streaming datasets via HuggingFace
* Mixed precision training
* Gradient accumulation
* Periodic checkpointing
* Full monitoring (loss, perplexity, gradient norm, attention entropy)

If you’d like the full training code or configs, feel free to reach out.

---

## License

This model follows the license of the underlying datasets and tokenizer.
Please ensure compliance before commercial usage.

---

## Acknowledgements

* HuggingFace 🤗
* PyTorch
* GPT-2 tokenizer
* Open research community