twitter-emotion-pl-classifier / README.md

Repo initialized

7336cba verified 2 months ago

15.3 kB

	---
	language:
	- pl
	license: gpl-3.0
	tags:
	- text-classification
	- emotion-classification
	- sentiment-analysis
	- polish
	- multi-label-classification
	- twitter
	datasets:
	- yazoniak/TwitterEmo-PL-Refined
	base_model: PKOBP/polish-roberta-8k
	metrics:
	- f1
	- accuracy
	pipeline_tag: text-classification
	model-index:
	- name: twitter-emotion-pl-classifier
	results:
	- task:
	type: text-classification
	name: Multi-Label Emotion Classification
	dataset:
	type: yazoniak/TwitterEmo-PL-Refined
	name: TwitterEmo-PL-Refined
	split: validation
	metrics:
	- type: f1
	value: 0.8500
	name: F1 Macro
	verified: true
	args:
	average: macro
	- type: f1
	value: 0.8900
	name: F1 Micro
	verified: true
	args:
	average: micro
	- type: f1
	value: 0.8895
	name: F1 Weighted
	verified: true
	args:
	average: weighted
	- type: accuracy
	value: 0.5125
	name: Exact Match Accuracy
	verified: true
	- type: accuracy
	value: 0.8900
	name: Subset Accuracy
	verified: true
	---

	# Polish Twitter Emotion Classifier (RoBERTa-8k)

	## Model Description

	This model is a fine-tuned version of [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k) for multi-label emotion and sentiment classification in Polish. It was trained on the [TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined) dataset.

	The model predicts 8 emotion and sentiment labels simultaneously:

	- Emotions: `radość` (joy), `wstręt` (disgust), `gniew` (anger), `przeczuwanie` (anticipation)
	- Sentiment: `pozytywny` (positive), `negatywny` (negative), `neutralny` (neutral)
	- Special: `sarkazm` (sarcasm)

	### Model Details

	- Model type: RoBERTa (Polish)
	- Language: Polish
	- Base model: [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)
	- Task: Multi-label text classification (emotion & sentiment)
	- Training data: 35,921 Polish tweets from TwitterEmo-PL-Refined
	- License: GPL-3.0
	- Context window: 8,192 tokens (max; for tweet-length texts you can use a smaller tokenizer `max_length`, e.g., 256-1024)

	## Intended Use

	### Primary Use Cases

	- Social media monitoring: Analyze emotions and sentiment in Polish tweets and social media posts
	- Customer feedback analysis: Understand emotional responses in Polish customer reviews
	- Research: Study emotion expression patterns in Polish language social media
	- Multi-label sentiment analysis: Capture nuanced emotional states beyond binary positive/negative

	### Out-of-Scope Use

	- This model is specifically trained on Polish Twitter data and may not generalize well to:
	- Formal Polish text (news articles, academic writing)
	- Other languages
	- Very long documents (optimal for tweet-length texts)

	## Performance

	### Overall Metrics

	\| Metric \| Score \|
	\|--------\|-------\|
	\| F1 Macro \| 0.8500 \|
	\| F1 Micro \| 0.8900 \|
	\| F1 Weighted \| 0.8895 \|
	\| Exact Match Accuracy \| 0.5125 \|
	\| Subset Accuracy \| 0.8900 \|
	\| Validation Loss \| 0.2761 \|

	### Per-Label Performance

	\| Label \| F1 Score \| Coverage \|
	\|-------\|----------\|----------\|
	\| negatywny (negative) \| 0.8553 \| 42.4% \|
	\| neutralny (neutral) \| 0.8172 \| 41.0% \|
	\| pozytywny (positive) \| 0.7814 \| 17.4% \|
	\| gniew (anger) \| 0.7693 \| 25.8% \|
	\| radość (joy) \| 0.7476 \| 11.9% \|
	\| wstręt (disgust) \| 0.7337 \| 20.4% \|
	\| przeczuwanie (anticipation) \| 0.7220 \| 21.6% \|
	\| sarkazm (sarcasm) \| 0.5337 \| 16.0% \|

	## Training Details

	### Training Data

	The model was trained on [TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined), which contains:

	- Total samples: 35,921 Polish tweets
	- Label distribution:
	- `negatywny`: 15,231 samples (42.4%)
	- `neutralny`: 14,720 samples (41.0%)
	- `gniew`: 9,252 samples (25.8%)
	- `przeczuwanie`: 7,776 samples (21.6%)
	- `wstręt`: 7,337 samples (20.4%)
	- `pozytywny`: 6,248 samples (17.4%)
	- `sarkazm`: 5,756 samples (16.0%)
	- `radość`: 4,283 samples (11.9%)

	### Training Configuration

	```python
	Model: PKOBP/polish-roberta-8k
	Training samples: 28,737 (80%)
	Validation samples: 7,184 (20%)

	Hyperparameters:
	- Learning rate: 1e-5
	- Batch size: 32 (train), 32 (eval)
	- Epochs: 4
	- Weight decay: 0.03
	- Warmup ratio: 0.1
	- Dropout rate: 0.2
	- Max gradient norm: 1.0
	- Optimizer: AdamW
	- LR scheduler: Cosine with warmup
	- Early stopping patience: 3
	- Mixed precision: BF16

	Training strategy:
	- Save strategy: Every 200 steps
	- Evaluation strategy: Every 200 steps
	- Best model selection: F1 Macro
	- Total training steps: 3,600
	- Best checkpoint: 3,400
	```

	### Training Process

	Training was conducted on single NVIDIA RTX 3090 GPU using a stratified 80/20 train-validation split with the following progression:

	![Training Progress](training_plots.png)

	## Calibration

	The model's predictions can be improved using temperature scaling and optimized thresholds. Calibration analysis shows:

	### Temperature Scaling Results

	Per-label temperature scaling reduces calibration error (Expected Calibration Error - ECE):

	\| Label \| Temperature \| ECE Before \| ECE After \| Improvement \|
	\|-------\|------------\|------------\|-----------\|-------------\|
	\| `radość` \| 1.066 \| 0.0163 \| 0.0166 \| -1.8% \|
	\| `wstręt` \| 1.117 \| 0.0211 \| 0.0152 \| +27.9% \|
	\| `gniew` \| 1.186 \| 0.0308 \| 0.0194 \| +37.0% \|
	\| `przeczuwanie` \| 1.102 \| 0.0228 \| 0.0237 \| -3.9% \|
	\| `pozytywny` \| 1.181 \| 0.0280 \| 0.0293 \| -4.6% \|
	\| `negatywny` \| 1.437 \| 0.0594 \| 0.0345 \| +41.9% \|
	\| `neutralny` \| 1.472 \| 0.0696 \| 0.0390 \| +44.0% \|
	\| `sarkazm` \| 1.078 \| 0.0202 \| 0.0202 \| 0.0% \|

	Key findings:

	- `neutralny`, `negatywny`, and `gniew` benefit most from temperature scaling
	- Some labels (`radość`, `przeczuwanie`, `pozytywny`) show minor degradation
	- Overall, calibration significantly improves probability reliability

	### Optimized Decision Thresholds

	Per-label F1-optimized thresholds (vs. default 0.5):

	\| Label \| Optimal Threshold \| F1 @ Optimal \| F1 @ 0.5 \| Improvement \|
	\|-------\|------------------\|--------------\|----------\|-------------\|
	\| `neutralny` \| 0.330 \| 0.8211 \| 0.8110 \| +1.00% \|
	\| `sarkazm` \| 0.330 \| 0.5766 \| 0.5256 \| +5.10% \|
	\| `przeczuwanie` \| 0.410 \| 0.7276 \| 0.7187 \| +0.89% \|
	\| `gniew` \| 0.440 \| 0.7692 \| 0.7676 \| +0.16% \|
	\| `negatywny` \| 0.450 \| 0.8516 \| 0.8511 \| +0.05% \|
	\| `wstręt` \| 0.460 \| 0.7477 \| 0.7464 \| +0.13% \|
	\| `pozytywny` \| 0.510 \| 0.7864 \| 0.7859 \| +0.04% \|
	\| `radość` \| 0.560 \| 0.7572 \| 0.7558 \| +0.14% \|

	Key findings:

	- `sarkazm` shows the largest improvement (+5.10%) with a lower threshold (0.33)
	- `neutralny` also benefits significantly (+1.00%) from a lower threshold (0.33)
	- Most labels perform optimally near the default 0.5 threshold
	- Total improvement with optimized thresholds: ~0.5-1.0% F1 Macro

	### Calibration Files

	The model repository includes:

	- Base model: `model.safetensors` - Use with default threshold (0.5)
	- Calibration artifacts: `calibration_artifacts.json` - Contains temperature parameters and optimal thresholds

	![Reliability diagrams*](calibration_reliability_diagrams.png)

	Recommendation: For production use, apply both temperature scaling and optimized thresholds for best performance.

	## Model Files

	This repository contains:

	- Model weights: `model.safetensors` - Fine-tuned RoBERTa model
	- Tokenizer: `tokenizer.json`, `tokenizer_config.json` - Polish RoBERTa tokenizer
	- Configuration: `config.json` - Model configuration
	- Calibration: `calibration_artifacts.json` - Temperature scaling parameters and optimal thresholds
	- Inference scripts:
	- `predict.py` - Basic inference (threshold: 0.5)
	- `predict_calibrated.py` - Calibrated inference (recommended)
	- Training artifacts: `training_plots`, `calibration_reliability_diagrams`
	- Requirements: `requirements.txt` - Python dependencies
	- License: `LICENSE` - Full GPL-3.0 license text

	### Installation

	```bash
	pip install -r requirements.txt
	```

	Or install dependencies manually:

	```bash
	pip install transformers torch numpy
	```

	## Usage

	### Important: Text Preprocessing

	The model expects @mentions to be anonymized, as they were during training. Both inference scripts automatically replace all `@username` mentions with `@anonymized_account` to match the training data distribution.

	### Quick Start (Basic Inference)

	Use the `predict.py` script for basic inference with default threshold (0.5):

	```bash
	# From Hugging Face (default) - mentions are automatically anonymized
	python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"

	# Example with mentions
	python predict.py "@zgp_intervillage Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"
	# Preprocessed internally: "@anonymized_account Uwielbiam czekać..."

	# From local model
	python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./

	# With custom threshold
	python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./ --threshold 0.3
	```

	Example Output:

	```
	Loading model from: yazoniak/twitter-emotion-pl-classifier

	Input text: Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp

	Assigned Labels:
	----------------------------------------
	radość
	pozytywny
	sarkazm

	All Labels (with probabilities):
	----------------------------------------
	✓ radość : 0.9574
	wstręt : 0.0566
	gniew : 0.0516
	przeczuwanie : 0.0347
	✓ pozytywny : 0.9782
	negatywny : 0.0602
	neutralny : 0.0336
	✓ sarkazm : 0.5404
	```

	### With Calibration

	Use the `predict_calibrated.py` script for calibrated inference with temperature scaling and optimized thresholds:

	```bash
	# From Hugging Face with calibration (requires calibration_artifacts.json)
	python predict_calibrated.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"
	```

	### Python API Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import numpy as np
	import re

	def preprocess_text(text):
	"""Preprocess text to match training data format."""
	# Anonymize @mentions (IMPORTANT for best performance)
	text = re.sub(r'@\w+', '@anonymized_account', text)
	return text

	# Load model
	model_name = "yazoniak/twitter-emotion-pl-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	model.eval()

	# Get labels from model config
	labels = [model.config.id2label[i] for i in range(model.config.num_labels)]

	# Prepare input with preprocessing
	text = "@jan_kowalski To jest wspaniały dzień!"
	preprocessed_text = preprocess_text(text) # "@anonymized_account To jest wspaniały dzień!"
	inputs = tokenizer(preprocessed_text, return_tensors="pt", truncation=True, max_length=8192)

	# Inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits

	# Get probabilities
	probabilities = torch.sigmoid(logits).squeeze().numpy()

	# Apply threshold
	threshold = 0.5
	predictions = {
	label: float(prob)
	for label, prob in zip(labels, probabilities)
	if prob > threshold
	}

	print(predictions)
	# Output: {'radość': 0.8734, 'pozytywny': 0.9156}
	```

	### Interpretation

	The model outputs logits for each of the 8 labels. To get predictions:

	1. Without calibration: Apply sigmoid, threshold at 0.5
	1. With calibration:
	- Apply sigmoid
	- Apply temperature scaling (divide logits by temperature before sigmoid)
	- Apply per-label optimized thresholds

	## Limitations and Biases

	### Known Limitations

	1. Preprocessing required: The model expects `@mentions` to be anonymized as `@anonymized_account` (matching training data). The provided inference scripts handle this automatically, but custom implementations must include this preprocessing step for optimal performance.

	1. Sarcasm detection: The model struggles with Polish sarcasm (F1: 0.53), which is inherently difficult to detect in text for BERT models without additional context.

	1. Class imbalance: Performance varies with label frequency:

	- High-frequency labels (`negatywny`, `neutralny`) perform best
	- Low-frequency labels (`radość`, `sarkazm`) show lower F1 scores

	1. Twitter-specific: The model is optimized for tweet-length texts (up to 8,192 tokens) with informal language, hashtags, and mentions.

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@model{yazoniak2025twitteremotionpl,
	title={Polish Twitter Emotion Classifier (RoBERTa-8k)},
	author={yazoniak},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/yazoniak/twitter-emotion-pl-classifier}
	}
	```

	Also cite the base model and dataset:

	```bibtex
	@dataset{yazoniak_twitteremo_pl_refined_2025,
	title = {TwitterEmo-PL-Refined: Polish Twitter Emotions (8 labels, refined)},
	author = {yazoniak},
	year = {2025},
	url = {https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined}
	}

	@inproceedings{bogdanowicz2023twitteremo,
	title = {TwitterEmo: Annotating Emotions and Sentiment in Polish Twitter},
	author = {Bogdanowicz, S. and Cwynar, H. and Zwierzchowska, A. and Klamra, C. and Kiera{\'s}, W. and Kobyli{\'n}ski, {\L}.},
	booktitle = {Computational Science -- ICCS 2023},
	series = {Lecture Notes in Computer Science},
	volume = {14074},
	publisher = {Springer, Cham},
	year = {2023},
	doi = {10.1007/978-3-031-36021-3_20}
	}
	```

	## Acknowledgments

	- Base model: [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)
	- Original dataset: [CLARIN-PL TwitterEmo](https://huggingface.co/datasets/clarin-pl/twitteremo)
	- Label cleaning: Cleanlab library for noise detection
	- LLM assistance: Gemini-2.5-Flash and GPT-4.1 for label review

	## License

	### License Terms

	This model is released under the GNU General Public License v3.0 (GPL-3.0), inherited from the training dataset.

	License Chain:

	- Base Model ([PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)): Apache-2.0
	- Training Dataset ([TwitterEmo-PL-Refined](https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined)): GPL-3.0
	- Original Dataset ([clarin-pl/twitteremo](https://huggingface.co/datasets/clarin-pl/twitteremo)): GPL-3.0
	- This Fine-tuned Model: GPL-3.0 (inherited from training data)

	### Full License Text

	The complete GPL-3.0 license text is available in the [LICENSE](LICENSE) file in this repository, or at: https://www.gnu.org/licenses/gpl-3.0.html

	## Model Card Contact

	For questions, issues, or feedback about this model, please open an issue in the model repository or contact the author through Hugging Face.

	______________________________________________________________________

	Model Version: v1.0
	Last Updated: 2025-10-10