BERT Spanish Sensationalism Classifier

Fine-tuned BERT model for detecting sensationalism in Spanish news articles.
The model performs binary text classification to determine whether a news item uses senstionalism techniques or not.

Model Details

Model Description

This model is a fine-tuned version of dccuchile/bert-base-spanish-wwm-cased, adapted for the task of sensationalism detection in Spanish news articles.

The classification is based on the combination of the news title and body text, allowing the model to capture both lexical and contextual cues commonly associated with sensationalist content.

Developed by: Julen Neila
Shared by: Julen Neila
Model type: Transformer-based text classifier (BERT)
Language(s): Spanish
License: Apache 2.0
Finetuned from model: dccuchile/bert-base-spanish-wwm-cased

Model Sources

Base model: dccuchile/bert-base-spanish-wwm-cased
Framework: Hugging Face Transformers

Uses

Direct Use

The model can be directly used to:

Detect sensationalism in Spanish news headlines and articles
Support media analysis and journalism studies
Assist in content moderation and media monitoring pipelines

Downstream Use

The model can be integrated into:

News aggregation systems
Media bias and sensationalism analysis
Academic NLP research projects
Larger information extraction or classification pipelines

Out-of-Scope Use

The model is not recommended for:

Social media posts or informal text
Non-Spanish content
Legal, medical, or high-stakes decision-making systems

Bias, Risks, and Limitations

The model reflects biases present in the training data.
It may underperform on very short texts or headlines without sufficient context.
It may not generalize well to domains outside traditional digital journalism.

Recommendations

Users should be aware of these limitations and avoid deploying the model in high-impact decision-making contexts without additional validation.

How to Get Started with the Model

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="JJNeila/bert-spanish-sensationalism-oss",
    tokenizer="JJNeila/bert-spanish-sensationalism-oss"
)

classifier("Estados Unidos entrena a 25.000 militares (1.400 españoles) para defender el este de Europa")

## Training Details

### Training Data

The model was trained on a curated dataset of Spanish news articles annotated for sensationalist presence.

- **Size:** ~3,163 labeled samples  
- **Labels:**  
  - `0` → Non-sensationalism 
  - `1` → Sensationalism  

The input format used during training was:

  *title* + *[SEP]* + *text*

### Training Procedure

#### Preprocessing 

- Removal of unlabeled samples  
- Concatenation of title and article text  
- Tokenization using the base BERT Spanish tokenizer  
- Maximum sequence length: **512 tokens**


#### Training Hyperparameters

- **Training regime:** fp16 mixed precision  
- **Optimizer:** AdamW  
- **Learning rate:** 2e-5  
- **Batch size:** 8  
- **Epochs:** 3  
- **Weight decay:** 0.01  
- **Evaluation metric for model selection:** F1

#### Speeds, Sizes, Times
- **Training time:** ~0,5 hours  
- **Hardware:** NVIDIA T4 GPU  
- **Final model size:** ~440 MB  

### Testing Data, Factors & Metrics

#### Testing Data

A held-out validation set (20%) stratified by class labels.

#### Metrics

The following metrics were used due to class imbalance considerations:

- Accuracy  
- Precision  
- Recall  
- F1-score  

### Results

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.84  |
| Precision | 0.84  |
| Recall   | 0.83  |
| F1-score | 0.84  |

#### Summary

The model achieves a strong balance between precision and recall, making it particularly effective at identifying sensationalism content without excessive false positives.

---

## Environmental Impact

- **Hardware Type:** NVIDIA T4 GPU  
- **Hours used:** ~0,5 hours  
- **Cloud Provider:** Google Colab  
- **Compute Region:** Europe  
- **Carbon Emitted:** Not explicitly measured  

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** BERT-base (12 layers, ~110M parameters)  
- **Objective:** Binary cross-entropy loss for text classification  


#### Hardware

- NVIDIA T4 GPU (16 GB VRAM)

#### Software

- Python 3.12  
- PyTorch  
- Transformers  
- Hugging Face Datasets  


## Citation

**BibTeX:**

```bibtex
@misc{neila2026sensationalism,
  title={BERT Spanish Sensationalism Classifier},
  author={Neila, Julen},
  year={2026},
  publisher={Hugging Face}
}

## Model Card Authors

**Julen Neila**

## Model Card Contact

https://huggingface.co/JJNeila

Downloads last month: 121

Safetensors

Model size

0.1B params

Tensor type

F32

Paper for JJNeila/bert-spanish-sensationalism-oss

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 35