Text Generation
Transformers
Safetensors
Czech
llama
text-generation-inference
CSTinyLlama-1.2B / README.md
mfajcik's picture
Update README.md
c2a4578 verified
|
raw
history blame
4.98 kB
metadata
license: apache-2.0
datasets:
  - BUT-FIT/BUT-LCC
language:
  - cs

Introduction

CSTinyLlama-1.2B is a Czech language model continously pretrained on 168b training tokens from English TinyLLama-2.5T model. Model was pretrained on ~67b token Large Czech Collection using Czech tokenizer, obtained using our vocabulary swap method. Training was done on Karolina cluster.

BUT Model Roster

Loss

Below we

  • (i) demonstrate the convergence speed of released model (TINYLLAMA1.2B_cztokenizer64k_align1.7k_tllama1.1B_C2048_lr1e-04_150k, at 160k step).
  • (ii) justify the contributions of our vocabulary swap method. We swap 1.7K tokens in this run, similarly as for our other models (see Czech-GPT-2-XL-133k), by comparing the swapped model with model trained from scratch (using same hyperparameters) scratch_cztokenizer64k_tllama1.1B_C2048_lr1e-04_150k.

Train Cross-Entropy

Test Perplexity

Training parameters

Not mentioned parameters are the same as for TinyLLama-2.5T.

Name Value Note
dataset_type Concat Sequences at the model's input were concatenated up to $max_seq_len, divided by EOS token.
tokenizer_size 64k
max_seq_len 2048
batch_size 512
learning_rate 1.0e-4
optimizer LionW
optimizer_betas 0.9/0.95
optimizer_weight_decay 0
gradient_clipping_max_norm 1.0
attn_impl flash2
fsdp SHARD_GRAD_OP (optimized for A100 40GB GPUs)
precision bf16
scheduler cosine
scheduler_warmup 100 steps
scheduler_steps 200,000
scheduler_alpha 0.1 So LR on last step is 0.1*(vanilla LR)

Usage

import torch
import transformers
from transformers import pipeline

name = 'BUT-FIT/CSTinyLlama-1.2B'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(
    name,
    config=config,
    trust_remote_code=True
)

tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')

with torch.autocast('cuda', dtype=torch.bfloat16):
    print(
        pipe('Nejznámějším českým spisovatelem ',
             max_new_tokens=100,
             top_p=0.95,
             repetition_penalty=1.0,
             do_sample=True,
             use_cache=True))