File size: 15,344 Bytes

7336cba

---
language:
- pl
license: gpl-3.0
tags:
- text-classification
- emotion-classification
- sentiment-analysis
- polish
- multi-label-classification
- twitter
datasets:
- yazoniak/TwitterEmo-PL-Refined
base_model: PKOBP/polish-roberta-8k
metrics:
- f1
- accuracy
pipeline_tag: text-classification
model-index:
  - name: twitter-emotion-pl-classifier
    results:
      - task:
          type: text-classification
          name: Multi-Label Emotion Classification
        dataset:
          type: yazoniak/TwitterEmo-PL-Refined
          name: TwitterEmo-PL-Refined
          split: validation
        metrics:
          - type: f1
            value: 0.8500
            name: F1 Macro
            verified: true
            args:
              average: macro
          - type: f1
            value: 0.8900
            name: F1 Micro
            verified: true
            args:
              average: micro
          - type: f1
            value: 0.8895
            name: F1 Weighted
            verified: true
            args:
              average: weighted
          - type: accuracy
            value: 0.5125
            name: Exact Match Accuracy
            verified: true
          - type: accuracy
            value: 0.8900
            name: Subset Accuracy
            verified: true
---

# Polish Twitter Emotion Classifier (RoBERTa-8k)

## Model Description

This model is a fine-tuned version of [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k) for multi-label emotion and sentiment classification in Polish. It was trained on the [TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined) dataset.

The model predicts 8 emotion and sentiment labels simultaneously:

- **Emotions**: `radość` (joy), `wstręt` (disgust), `gniew` (anger), `przeczuwanie` (anticipation)
- **Sentiment**: `pozytywny` (positive), `negatywny` (negative), `neutralny` (neutral)
- **Special**: `sarkazm` (sarcasm)

### Model Details

- **Model type**: RoBERTa (Polish)
- **Language**: Polish
- **Base model**: [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)
- **Task**: Multi-label text classification (emotion & sentiment)
- **Training data**: 35,921 Polish tweets from TwitterEmo-PL-Refined
- **License**: GPL-3.0
- **Context window**: 8,192 tokens (max; for tweet-length texts you can use a smaller tokenizer `max_length`, e.g., 256-1024)

## Intended Use

### Primary Use Cases

- **Social media monitoring**: Analyze emotions and sentiment in Polish tweets and social media posts
- **Customer feedback analysis**: Understand emotional responses in Polish customer reviews
- **Research**: Study emotion expression patterns in Polish language social media
- **Multi-label sentiment analysis**: Capture nuanced emotional states beyond binary positive/negative

### Out-of-Scope Use

- This model is specifically trained on Polish Twitter data and may not generalize well to:
  - Formal Polish text (news articles, academic writing)
  - Other languages
  - Very long documents (optimal for tweet-length texts)

## Performance

### Overall Metrics

| Metric | Score |
|--------|-------|
| **F1 Macro** | **0.8500** |
| **F1 Micro** | **0.8900** |
| **F1 Weighted** | **0.8895** |
| **Exact Match Accuracy** | **0.5125** |
| **Subset Accuracy** | **0.8900** |
| **Validation Loss** | **0.2761** |

### Per-Label Performance

| Label | F1 Score | Coverage |
|-------|----------|----------|
| **negatywny** (negative) | **0.8553** | 42.4% |
| **neutralny** (neutral) | **0.8172** | 41.0% |
| **pozytywny** (positive) | **0.7814** | 17.4% |
| **gniew** (anger) | **0.7693** | 25.8% |
| **radość** (joy) | **0.7476** | 11.9% |
| **wstręt** (disgust) | **0.7337** | 20.4% |
| **przeczuwanie** (anticipation) | **0.7220** | 21.6% |
| **sarkazm** (sarcasm) | **0.5337** | 16.0% |

## Training Details

### Training Data

The model was trained on [TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined), which contains:

- **Total samples**: 35,921 Polish tweets
- **Label distribution**:
  - `negatywny`: 15,231 samples (42.4%)
  - `neutralny`: 14,720 samples (41.0%)
  - `gniew`: 9,252 samples (25.8%)
  - `przeczuwanie`: 7,776 samples (21.6%)
  - `wstręt`: 7,337 samples (20.4%)
  - `pozytywny`: 6,248 samples (17.4%)
  - `sarkazm`: 5,756 samples (16.0%)
  - `radość`: 4,283 samples (11.9%)

### Training Configuration

```python
Model: PKOBP/polish-roberta-8k
Training samples: 28,737 (80%)
Validation samples: 7,184 (20%)

Hyperparameters:
- Learning rate: 1e-5
- Batch size: 32 (train), 32 (eval)
- Epochs: 4
- Weight decay: 0.03
- Warmup ratio: 0.1
- Dropout rate: 0.2
- Max gradient norm: 1.0
- Optimizer: AdamW
- LR scheduler: Cosine with warmup
- Early stopping patience: 3
- Mixed precision: BF16

Training strategy:
- Save strategy: Every 200 steps
- Evaluation strategy: Every 200 steps
- Best model selection: F1 Macro
- Total training steps: 3,600
- Best checkpoint: 3,400
```

### Training Process

Training was conducted on single NVIDIA RTX 3090 GPU using a stratified 80/20 train-validation split with the following progression:

![Training Progress](training_plots.png)

## Calibration

The model's predictions can be improved using **temperature scaling** and **optimized thresholds**. Calibration analysis shows:

### Temperature Scaling Results

Per-label temperature scaling reduces calibration error (Expected Calibration Error - ECE):

| Label | Temperature | ECE Before | ECE After | Improvement |
|-------|------------|------------|-----------|-------------|
| `radość` | 1.066 | 0.0163 | 0.0166 | -1.8% |
| `wstręt` | 1.117 | 0.0211 | 0.0152 | **+27.9%** |
| `gniew` | 1.186 | 0.0308 | 0.0194 | **+37.0%** |
| `przeczuwanie` | 1.102 | 0.0228 | 0.0237 | -3.9% |
| `pozytywny` | 1.181 | 0.0280 | 0.0293 | -4.6% |
| `negatywny` | 1.437 | 0.0594 | 0.0345 | **+41.9%** |
| `neutralny` | 1.472 | 0.0696 | 0.0390 | **+44.0%** |
| `sarkazm` | 1.078 | 0.0202 | 0.0202 | 0.0% |

**Key findings:**

- `neutralny`, `negatywny`, and `gniew` benefit most from temperature scaling
- Some labels (`radość`, `przeczuwanie`, `pozytywny`) show minor degradation
- Overall, calibration significantly improves probability reliability

### Optimized Decision Thresholds

Per-label F1-optimized thresholds (vs. default 0.5):

| Label | Optimal Threshold | F1 @ Optimal | F1 @ 0.5 | Improvement |
|-------|------------------|--------------|----------|-------------|
| `neutralny` | **0.330** | **0.8211** | 0.8110 | **+1.00%** |
| `sarkazm` | **0.330** | **0.5766** | 0.5256 | **+5.10%** |
| `przeczuwanie` | 0.410 | 0.7276 | 0.7187 | +0.89% |
| `gniew` | 0.440 | 0.7692 | 0.7676 | +0.16% |
| `negatywny` | 0.450 | 0.8516 | 0.8511 | +0.05% |
| `wstręt` | 0.460 | 0.7477 | 0.7464 | +0.13% |
| `pozytywny` | 0.510 | 0.7864 | 0.7859 | +0.04% |
| `radość` | 0.560 | 0.7572 | 0.7558 | +0.14% |

**Key findings:**

- `sarkazm` shows the largest improvement (+5.10%) with a lower threshold (0.33)
- `neutralny` also benefits significantly (+1.00%) from a lower threshold (0.33)
- Most labels perform optimally near the default 0.5 threshold
- Total improvement with optimized thresholds: **~0.5-1.0% F1 Macro**

### Calibration Files

The model repository includes:

- **Base model**: `model.safetensors` - Use with default threshold (0.5)
- **Calibration artifacts**: `calibration_artifacts.json` - Contains temperature parameters and optimal thresholds

![Reliability diagrams*](calibration_reliability_diagrams.png)

**Recommendation**: For production use, apply both temperature scaling and optimized thresholds for best performance.

## Model Files

This repository contains:

- **Model weights**: `model.safetensors` - Fine-tuned RoBERTa model
- **Tokenizer**: `tokenizer.json`, `tokenizer_config.json` - Polish RoBERTa tokenizer
- **Configuration**: `config.json` - Model configuration
- **Calibration**: `calibration_artifacts.json` - Temperature scaling parameters and optimal thresholds
- **Inference scripts**:
  - `predict.py` - Basic inference (threshold: 0.5)
  - `predict_calibrated.py` - Calibrated inference (recommended)
- **Training artifacts**: `training_plots`, `calibration_reliability_diagrams`
- **Requirements**: `requirements.txt` - Python dependencies
- **License**: `LICENSE` - Full GPL-3.0 license text

### Installation

```bash
pip install -r requirements.txt
```

Or install dependencies manually:

```bash
pip install transformers torch numpy
```

## Usage

### Important: Text Preprocessing

**The model expects @mentions to be anonymized**, as they were during training. Both inference scripts automatically replace all `@username` mentions with `@anonymized_account` to match the training data distribution.

### Quick Start (Basic Inference)

Use the `predict.py` script for basic inference with default threshold (0.5):

```bash
# From Hugging Face (default) - mentions are automatically anonymized
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"

# Example with mentions
python predict.py "@zgp_intervillage Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"
# Preprocessed internally: "@anonymized_account Uwielbiam czekać..."

# From local model
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./

# With custom threshold
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./ --threshold 0.3
```

**Example Output:**

```
Loading model from: yazoniak/twitter-emotion-pl-classifier

Input text: Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp

Assigned Labels:
----------------------------------------
  radość
  pozytywny
  sarkazm

All Labels (with probabilities):
----------------------------------------
✓ radość         : 0.9574
  wstręt         : 0.0566
  gniew          : 0.0516
  przeczuwanie   : 0.0347
✓ pozytywny      : 0.9782
  negatywny      : 0.0602
  neutralny      : 0.0336
✓ sarkazm        : 0.5404
```

### With Calibration

Use the `predict_calibrated.py` script for calibrated inference with temperature scaling and optimized thresholds:

```bash
# From Hugging Face with calibration (requires calibration_artifacts.json)
python predict_calibrated.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"
```

### Python API Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
import re

def preprocess_text(text):
    """Preprocess text to match training data format."""
    # Anonymize @mentions (IMPORTANT for best performance)
    text = re.sub(r'@\w+', '@anonymized_account', text)
    return text

# Load model
model_name = "yazoniak/twitter-emotion-pl-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Get labels from model config
labels = [model.config.id2label[i] for i in range(model.config.num_labels)]

# Prepare input with preprocessing
text = "@jan_kowalski To jest wspaniały dzień!"
preprocessed_text = preprocess_text(text)  # "@anonymized_account To jest wspaniały dzień!"
inputs = tokenizer(preprocessed_text, return_tensors="pt", truncation=True, max_length=8192)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get probabilities
probabilities = torch.sigmoid(logits).squeeze().numpy()

# Apply threshold
threshold = 0.5
predictions = {
    label: float(prob) 
    for label, prob in zip(labels, probabilities) 
    if prob > threshold
}

print(predictions)
# Output: {'radość': 0.8734, 'pozytywny': 0.9156}
```

### Interpretation

The model outputs logits for each of the 8 labels. To get predictions:

1. **Without calibration**: Apply sigmoid, threshold at 0.5
1. **With calibration**:
   - Apply sigmoid
   - Apply temperature scaling (divide logits by temperature before sigmoid)
   - Apply per-label optimized thresholds

## Limitations and Biases

### Known Limitations

1. **Preprocessing required**: The model expects `@mentions` to be anonymized as `@anonymized_account` (matching training data). The provided inference scripts handle this automatically, but custom implementations must include this preprocessing step for optimal performance.

1. **Sarcasm detection**: The model struggles with Polish sarcasm (F1: 0.53), which is inherently difficult to detect in text for BERT models without additional context.

1. **Class imbalance**: Performance varies with label frequency:

   - High-frequency labels (`negatywny`, `neutralny`) perform best
   - Low-frequency labels (`radość`, `sarkazm`) show lower F1 scores

1. **Twitter-specific**: The model is optimized for tweet-length texts (up to 8,192 tokens) with informal language, hashtags, and mentions.

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@model{yazoniak2025twitteremotionpl,
  title={Polish Twitter Emotion Classifier (RoBERTa-8k)},
  author={yazoniak},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/yazoniak/twitter-emotion-pl-classifier}
}
```

Also cite the base model and dataset:

```bibtex
@dataset{yazoniak_twitteremo_pl_refined_2025,
  title   = {TwitterEmo-PL-Refined: Polish Twitter Emotions (8 labels, refined)},
  author  = {yazoniak},
  year    = {2025},
  url     = {https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined}
}

@inproceedings{bogdanowicz2023twitteremo,
  title     = {TwitterEmo: Annotating Emotions and Sentiment in Polish Twitter},
  author    = {Bogdanowicz, S. and Cwynar, H. and Zwierzchowska, A. and Klamra, C. and Kiera{\'s}, W. and Kobyli{\'n}ski, {\L}.},
  booktitle = {Computational Science -- ICCS 2023},
  series    = {Lecture Notes in Computer Science},
  volume    = {14074},
  publisher = {Springer, Cham},
  year      = {2023},
  doi       = {10.1007/978-3-031-36021-3_20}
}
```

## Acknowledgments

- **Base model**: [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)
- **Original dataset**: [CLARIN-PL TwitterEmo](https://huggingface.co/datasets/clarin-pl/twitteremo)
- **Label cleaning**: Cleanlab library for noise detection
- **LLM assistance**: Gemini-2.5-Flash and GPT-4.1 for label review

## License

### License Terms

This model is released under the **GNU General Public License v3.0 (GPL-3.0)**, inherited from the training dataset.

**License Chain:**

- **Base Model** ([PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)): Apache-2.0
- **Training Dataset** ([TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined)): GPL-3.0
- **Original Dataset** ([clarin-pl/twitteremo](https://huggingface.co/datasets/clarin-pl/twitteremo)): GPL-3.0
- **This Fine-tuned Model**: **GPL-3.0** (inherited from training data)

### Full License Text

The complete GPL-3.0 license text is available in the [LICENSE](LICENSE) file in this repository, or at: https://www.gnu.org/licenses/gpl-3.0.html

## Model Card Contact

For questions, issues, or feedback about this model, please open an issue in the model repository or contact the author through Hugging Face.

______________________________________________________________________

**Model Version**: v1.0
**Last Updated**: 2025-10-10