Update README.md

e6f8378 verified 9 months ago

6.61 kB

	---
	language: en
	tags:
	- text-classification
	- gender
	- gender-prediction
	- transformers
	- deberta
	license: mit
	datasets:
	- samzirbo/europarl.en-es.gendered
	- czyzi0/luna-speech-dataset
	- czyzi0/pwr-azon-speech-dataset
	- sagteam/author_profiling
	- kaushalgawri/nptel-en-tags-and-gender-v0
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model: microsoft/deberta-v3-large
	pipeline_tag: text-classification
	model-index:
	- name: gender_prediction_model_from_text
	results:
	- task:
	type: text-classification
	name: Text Classification
	metrics:
	- type: f1
	value: 0.69
	- type: accuracy
	value: 0.69
	citations:
	- "@misc{fc63_gender1_2025,\n title = {Gender Prediction from Text},\n author = {Çoban, Furkan},\n year = {2025},\n howpublished = {\\url{https://doi.org/10.5281/zenodo.15619489}},\n note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}"
	---


	# Gender Prediction from Text ✍️ → 👩‍🦰👨

	This model predicts the likely gender of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.

	📍 Space link: [🔗 Try it out on Hugging Face Spaces](https://huggingface.co/spaces/fc63/Gender_Prediction)
	📁 Model repo: [🔗 View on Hugging Face Hub](https://huggingface.co/fc63/gender_prediction_model_from_text)
	🧠 Source code: [GitHub](https://github.com/fc63/gender-classification)

	---

	## 📊 Model Summary

	- Base model: `microsoft/deberta-v3-large`
	- Fine-tuned on: binary gender classification task (`female` vs `male`)
	- Best F1 Score: `0.69` on a balanced multi-domain test set
	- Max token length: 128
	- Evaluation Metrics:
	- F1: 0.69
	- Accuracy: 0.69
	- Precision: 0.69
	- Recall: 0.69

	📂 Evaluation: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)

	---

	## 🧾 Datasets Used

	\| Dataset \| Domain \| Type \|
	\|--------\|--------\|------\|
	\| [samzirbo/europarl.en-es.gendered](https://huggingface.co/datasets/samzirbo/europarl.en-es.gendered) \| Formal speech (Parliament) \| English \|
	\| [czyzi0/luna-speech-dataset](https://huggingface.co/datasets/czyzi0/luna-speech-dataset) \| Phone conversations \| Polish → Translated \|
	\| [czyzi0/pwr-azon-speech-dataset](https://huggingface.co/datasets/czyzi0/pwr-azon-speech-dataset) \| Phone conversations \| Polish → Translated \|
	\| [sagteam/author_profiling](https://huggingface.co/datasets/sagteam/author_profiling) \| Social posts \| Russian → Translated \|
	\| [kaushalgawri/nptel-en-tags-and-gender-v0](https://huggingface.co/datasets/kaushalgawri/nptel-en-tags-and-gender-v0) \| Spoken transcripts \| English \|
	\| [Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm) \| Blog posts \| English \|

	All datasets were normalized, translated if necessary, deduplicated, and balanced via random undersampling to ensure equal representation of both genders.

	---

	## 🛠️ Preprocessing & Training

	- Normalization: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
	- Translation: Used `Helsinki-NLP/opus-mt-*` models for Polish and Russian data.
	- Undersampling: Random undersampling to balance male and female samples.
	- Training Strategy:
	- LR Finder used to optimize learning rate (`2.66e-6`)
	- Fine-tuned using early stopping on both F1 and loss
	- Step-based evaluation every 250 steps
	- Best checkpoint at step 24,750 saved and evaluated
	- Second Phase Fine-tuning:
	- Performed on full merged dataset for 2 epochs
	- Used cosine learning rate scheduler and warm-up steps

	---

	## 📈 Performance (on full merged test set)

	\| Class \| Precision \| Recall \| F1-Score \| Accuracy \| Support \|
	\|-----\|-----\|--------\|----------\|---------\|---------\|
	\| Female \| 0.70 \| 0.65 \| 0.68 \| \| 591,027 \|
	\| Male \| 0.68 \| 0.72 \| 0.70 \| \| 591,027 \|
	\| Macro Avg \| 0.69 \| 0.69 \| 0.69 \| \| 1,182,054 \|
	\| Accuracy \| \| \| \| 0.69 \| 1,182,054 \|

	---

	## 📦 Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import torch.nn.functional as F

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model_name = "fc63/gender_prediction_model_from_text"
	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)

	def predict(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
	with torch.no_grad():
	outputs = model(**inputs)
	probs = F.softmax(outputs.logits, dim=1)
	pred = torch.argmax(probs, dim=1).item()
	confidence = round(probs[0][pred].item() * 100, 1)
	gender = "Female" if pred == 0 else "Male"
	return f"{gender} (Confidence: {confidence}%)"
	```
	```
	sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
	print(predict(sample_text))
	```
	The Output Of This Sample:
	```
	Female (Confidence: 84.1%)
	```

	---

	## 📌 Future Work & Limitations


	I do not want to leave this model at the level of 0.69 accuracy and F1 score.

	As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.

	The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.

	To make further progress, I need to create and label a larger dataset myself — which requires a significant amount of time, effort, and cost.

	Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset — and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.

	---

	## 👨‍🔬 Author & License

	Author: Furkan Çoban
	Project: CENG-481 Gender Prediction Model
	License: MIT