Spaces:

shegga
/

SentimentAnalysisForNMTTNT

Runtime error

App Files Files Community

SentimentAnalysisForNMTTNT / README.md

shegga

Implement Vietnamese Sentiment Analysis: Fine-tuning, Gradio Interface, and Model Testing

0210351 about 2 months ago

preview code

raw

history blame

11.9 kB

🎭 Vietnamese Sentiment Analysis

A comprehensive Vietnamese sentiment analysis system built with transformer models, featuring training, testing, demo, and web interface capabilities with advanced memory management.

🚀 Features

🤖 Transformer-based Model: Fine-tuned Vietnamese sentiment analysis using Visobert
🌐 Interactive Web Interface: Real-time sentiment analysis via Gradio with memory optimization
📊 Comprehensive Testing: Model evaluation with confusion matrix and classification metrics
⚡ Memory Efficient: Built-in memory management, batch processing limits, and quantization support
🎯 Easy to Use: Simple command-line interface and web UI
📈 Performance Monitoring: Real-time memory usage tracking and optimization

📁 Project Structure

SentimentAnalysis/
├── README.md                          # 📚 This file
├── requirements.txt                   # 📦 Python dependencies
├── .gitignore                         # 🚫 Git ignore rules
│
├── py/                                # 🐍 Core Python modules
│   ├── __init__.py                   # Package initialization
│   ├── fine_tune_sentiment.py        # 🔧 Core fine-tuning utilities
│   ├── test_model.py                 # 🧪 Model testing and evaluation
│   ├── demo.py                      # 💻 Demo functionality
│   └── gradio_app.py                # 🌐 Web interface (memory-optimized)
│
├── main.py                            # 🚀 Main entry point (all commands)
├── train.py                           # 🏋️ Training script
├── test.py                            # 🧪 Testing script
├── demo.py                            # 💻 Interactive demo
└── web.py                             # 🌐 Web interface launcher
│
├── vietnamese_sentiment_finetuned/   # 🤖 Trained model (auto-generated)
├── confusion_matrix.png             # 📊 Evaluation visualization (auto-generated)
├── training_history.png             # 📈 Training progress (auto-generated)
├── pdf/                             # 📄 Documentation folder
├── venv/                            # 🐍 Virtual environment
├── .git/                            # 📝 Git repository
└── .claude/                         # 🤖 Claude configuration

🛠️ Installation

Clone and Setup Environment

cd SentimentAnalysis
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

🎯 Usage

Quick Start Options

Option 1: Use Individual Scripts

# Train the model
python train.py

# Test the model
python test.py

# Run interactive demo
python demo.py

# Launch web interface
python web.py

Option 2: Use Main Entry Point

# Train with custom settings
python main.py train --batch-size 32 --epochs 5

# Test the model
python main.py test --model-path ./vietnamese_sentiment_finetuned

# Run interactive demo
python main.py demo

# Launch web interface with memory options
python main.py web --quantize --max-batch-size 20 --port 8080

1. Training the Model

# Basic training
python train.py

# Custom batch size and epochs
python train.py 32 5

# Using main script
python main.py train --batch-size 32 --epochs 5 --learning-rate 1e-5

2. Testing the Model

# Basic testing
python test.py

# Test with custom model path
python test.py /path/to/custom/model

# Using main script
python main.py test --model-path ./vietnamese_sentiment_finetuned

3. Interactive Demo

# Run demo
python demo.py

# Using main script
python main.py demo

4. Web Interface

# Standard usage (memory-efficient defaults)
python web.py

# High memory efficiency (quantization + small batches)
python web.py --quantize --max-batch-size 5 --max-memory 2048

# Large batch processing
python web.py --max-batch-size 20 --max-memory 8192

# Custom server configuration
python web.py --port 8080 --host 0.0.0.0 --quantize

# Using main script
python main.py web --quantize --max-batch-size 20 --port 8080

🌐 Web Interface Features

The Gradio web interface provides:

📝 Single Text Analysis

Real-time sentiment prediction
Confidence scores with visual charts
Memory usage monitoring
Example texts for quick testing

📊 Batch Analysis

Process multiple texts at once
Memory-efficient batch processing
Automatic batch size limits
Batch summary with sentiment distribution

🛡️ Memory Management

Automatic Cleanup: Memory cleaned after each prediction
Batch Limits: Configurable maximum texts per batch
Memory Monitoring: Real-time memory usage tracking
GPU Optimization: CUDA cache clearing when available
Quantization: Optional model quantization for CPU (~4x memory reduction)

ℹ️ Model Information

Detailed model specifications
Performance metrics
Memory management settings
Usage tips and troubleshooting

🔧 Command Line Options

Individual Scripts

`train.py`

python train.py [batch_size] [epochs]

`test.py`

python test.py [model_path]

`demo.py`

python demo.py

`web.py`

python web.py [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST]

Main Entry Point (`main.py`)

Training Command

python main.py train [--batch-size SIZE] [--epochs NUM] [--learning-rate RATE]

Testing Command

python main.py test [--model-path PATH]

Demo Command

python main.py demo

Web Interface Command

python main.py web [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST]

Memory Management Options:

--max-batch-size: Maximum batch size for memory efficiency (default: 10)
--quantize: Enable model quantization for memory efficiency (CPU only)
--max-memory: Maximum memory usage in MB (default: 4096)
--port: Port to run the interface on (default: 7862)
--host: Host to bind the interface to (default: 127.0.0.1)

📊 Model Details

Base Model: 5CD-AI/Vietnamese-Sentiment-visobert
Dataset: uitnlp/vietnamese_students_feedback
Labels: Negative, Neutral, Positive
Language: Vietnamese
Architecture: Transformer-based sequence classification
Max Sequence Length: 512 tokens

📈 Performance Metrics

Accuracy: 85-90% (on validation set)
Processing Speed: ~100ms per text
Memory Usage: Configurable (default 4GB limit)
Batch Processing: Up to 20 texts (configurable)

🛡️ Memory Management

The system includes comprehensive memory management:

Automatic Features

Memory cleanup after each prediction
GPU cache clearing for CUDA
Garbage collection management
Memory monitoring before/after operations

User Controls

Configurable batch size limits
Memory limit enforcement
Manual memory cleanup button
Real-time memory usage display

Optimization Options

Dynamic quantization (CPU only)
Batch processing optimization
Memory-efficient inference

🔍 Troubleshooting

Memory Issues

Enable quantization: python gradio_app.py --quantize
Reduce batch size: python gradio_app.py --max-batch-size 5
Lower memory limit: python gradio_app.py --max-memory 2048
Use manual cleanup: Click "Memory Cleanup" button in web interface

Model Loading Issues

Ensure model is trained: python run_training.py
Check model directory: ls -la vietnamese_sentiment_finetuned/
Verify dependencies: pip install -r requirements.txt

Performance Optimization

Use GPU if available (CUDA)
Enable quantization for CPU inference
Monitor memory usage in web interface
Adjust batch size based on available memory

📄 Requirements

See requirements.txt for complete dependency list:

torch>=2.0.0
transformers>=4.21.0
datasets>=2.0.0
gradio>=4.0.0
pandas>=1.5.0
numpy>=1.21.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
psutil>=5.9.0

🎯 Example Usage

Command Line Demo

from py.demo import SentimentDemo

demo = SentimentDemo()
demo.load_model()
demo.interactive_demo()

Web Interface

Train model: python train.py
Launch interface: python web.py
Open browser to http://127.0.0.1:7862
Enter Vietnamese text for analysis

Batch Processing

from py.gradio_app import SentimentGradioApp

app = SentimentGradioApp(max_batch_size=20)
app.load_model()
texts = ["Tuyệt vời!", "Bình thường", "Rất tệ"]
results, summary = app.batch_predict(texts)

Model Testing

from py.test_model import SentimentTester

tester = SentimentTester(model_path="./vietnamese_sentiment_finetuned")
tester.load_model()
sentiment, confidence = tester.predict_sentiment("Giảng viên dạy rất hay!")

Fine-Tuning

from py.fine_tune_sentiment import SentimentFineTuner

fine_tuner = SentimentFineTuner(
    model_name="5CD-AI/Vietnamese-Sentiment-visobert",
    dataset_name="uitnlp/vietnamese_students_feedback"
)
train_result, eval_results = fine_tuner.run_fine_tuning(
    output_dir="./my_model",
    learning_rate=2e-5,
    batch_size=16,
    num_epochs=3
)

📝 Model Loading Examples

Loading the Fine-Tuned Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./vietnamese_sentiment_finetuned")
model = AutoModelForSequenceClassification.from_pretrained("./vietnamese_sentiment_finetuned")

Making Predictions

import torch

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()

    sentiment_labels = ["Negative", "Neutral", "Positive"]
    return sentiment_labels[predicted_class], predictions[0][predicted_class].item()

# Example
text = "Giảng viên dạy rất hay và tâm huyết."
sentiment, confidence = predict_sentiment(text)
print(f"Sentiment: {sentiment}, Confidence: {confidence:.3f}")

📊 Dataset Information

The UIT-VSFC corpus contains over 16,000 Vietnamese student feedback sentences with:

Sentiment Classification: Positive, Neutral, Negative
Topic Classification: Various educational topics
Inter-annotator agreement: >91% for sentiment, >71% for topics
Original F1-score: ~88% for sentiment (Maximum Entropy baseline)

🔧 Hardware Requirements

Minimum: 8GB RAM, CPU
Recommended: GPU with 8GB+ VRAM for faster training
Storage: ~2GB for model and datasets

📝 License

This project uses open-source components for educational and research purposes. Please check individual licenses for:

5CD-AI/Vietnamese-Sentiment-visobert
uitnlp/vietnamese_students_feedback

🤝 Contributing

Feel free to submit issues and enhancement requests!

📄 Citation

If you use this work or the dataset, please cite:

@InProceedings{8573337,
  author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},
  booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},
  title={UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis},
  year={2018},
  volume={},
  number={},
  pages={19-24},
  doi={10.1109/KSE.2018.8573337}
}

Quick Start: python train.py && python web.py

Alternative: python main.py train && python main.py web

🎭 Vietnamese Sentiment Analysis

🚀 Features

📁 Project Structure

🛠️ Installation

🎯 Usage

Quick Start Options

Option 1: Use Individual Scripts

Option 2: Use Main Entry Point

1. Training the Model

2. Testing the Model

3. Interactive Demo

4. Web Interface

🌐 Web Interface Features

📝 Single Text Analysis

📊 Batch Analysis

🛡️ Memory Management

ℹ️ Model Information

🔧 Command Line Options

Individual Scripts

train.py

test.py

demo.py

web.py

Main Entry Point (main.py)

Training Command

Testing Command

Demo Command

Web Interface Command

📊 Model Details

📈 Performance Metrics

🛡️ Memory Management

Automatic Features

User Controls

Optimization Options

🔍 Troubleshooting

Memory Issues

Model Loading Issues

Performance Optimization

📄 Requirements

🎯 Example Usage

Command Line Demo

Web Interface

Batch Processing

Model Testing

Fine-Tuning

📝 Model Loading Examples

Loading the Fine-Tuned Model

Making Predictions

📊 Dataset Information

🔧 Hardware Requirements

📝 License

🤝 Contributing

📄 Citation

`train.py`

`test.py`

`demo.py`

`web.py`

Main Entry Point (`main.py`)