shegga's picture
Implement Vietnamese Sentiment Analysis: Fine-tuning, Gradio Interface, and Model Testing
0210351
|
raw
history blame
11.9 kB

🎭 Vietnamese Sentiment Analysis

A comprehensive Vietnamese sentiment analysis system built with transformer models, featuring training, testing, demo, and web interface capabilities with advanced memory management.

πŸš€ Features

  • πŸ€– Transformer-based Model: Fine-tuned Vietnamese sentiment analysis using Visobert
  • 🌐 Interactive Web Interface: Real-time sentiment analysis via Gradio with memory optimization
  • πŸ“Š Comprehensive Testing: Model evaluation with confusion matrix and classification metrics
  • ⚑ Memory Efficient: Built-in memory management, batch processing limits, and quantization support
  • 🎯 Easy to Use: Simple command-line interface and web UI
  • πŸ“ˆ Performance Monitoring: Real-time memory usage tracking and optimization

πŸ“ Project Structure

SentimentAnalysis/
β”œβ”€β”€ README.md                          # πŸ“š This file
β”œβ”€β”€ requirements.txt                   # πŸ“¦ Python dependencies
β”œβ”€β”€ .gitignore                         # 🚫 Git ignore rules
β”‚
β”œβ”€β”€ py/                                # 🐍 Core Python modules
β”‚   β”œβ”€β”€ __init__.py                   # Package initialization
β”‚   β”œβ”€β”€ fine_tune_sentiment.py        # πŸ”§ Core fine-tuning utilities
β”‚   β”œβ”€β”€ test_model.py                 # πŸ§ͺ Model testing and evaluation
β”‚   β”œβ”€β”€ demo.py                      # πŸ’» Demo functionality
β”‚   └── gradio_app.py                # 🌐 Web interface (memory-optimized)
β”‚
β”œβ”€β”€ main.py                            # πŸš€ Main entry point (all commands)
β”œβ”€β”€ train.py                           # πŸ‹οΈ Training script
β”œβ”€β”€ test.py                            # πŸ§ͺ Testing script
β”œβ”€β”€ demo.py                            # πŸ’» Interactive demo
└── web.py                             # 🌐 Web interface launcher
β”‚
β”œβ”€β”€ vietnamese_sentiment_finetuned/   # πŸ€– Trained model (auto-generated)
β”œβ”€β”€ confusion_matrix.png             # πŸ“Š Evaluation visualization (auto-generated)
β”œβ”€β”€ training_history.png             # πŸ“ˆ Training progress (auto-generated)
β”œβ”€β”€ pdf/                             # πŸ“„ Documentation folder
β”œβ”€β”€ venv/                            # 🐍 Virtual environment
β”œβ”€β”€ .git/                            # πŸ“ Git repository
└── .claude/                         # πŸ€– Claude configuration

πŸ› οΈ Installation

  1. Clone and Setup Environment
cd SentimentAnalysis
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install Dependencies
pip install -r requirements.txt

🎯 Usage

Quick Start Options

Option 1: Use Individual Scripts

# Train the model
python train.py

# Test the model
python test.py

# Run interactive demo
python demo.py

# Launch web interface
python web.py

Option 2: Use Main Entry Point

# Train with custom settings
python main.py train --batch-size 32 --epochs 5

# Test the model
python main.py test --model-path ./vietnamese_sentiment_finetuned

# Run interactive demo
python main.py demo

# Launch web interface with memory options
python main.py web --quantize --max-batch-size 20 --port 8080

1. Training the Model

# Basic training
python train.py

# Custom batch size and epochs
python train.py 32 5

# Using main script
python main.py train --batch-size 32 --epochs 5 --learning-rate 1e-5

2. Testing the Model

# Basic testing
python test.py

# Test with custom model path
python test.py /path/to/custom/model

# Using main script
python main.py test --model-path ./vietnamese_sentiment_finetuned

3. Interactive Demo

# Run demo
python demo.py

# Using main script
python main.py demo

4. Web Interface

# Standard usage (memory-efficient defaults)
python web.py

# High memory efficiency (quantization + small batches)
python web.py --quantize --max-batch-size 5 --max-memory 2048

# Large batch processing
python web.py --max-batch-size 20 --max-memory 8192

# Custom server configuration
python web.py --port 8080 --host 0.0.0.0 --quantize

# Using main script
python main.py web --quantize --max-batch-size 20 --port 8080

🌐 Web Interface Features

The Gradio web interface provides:

πŸ“ Single Text Analysis

  • Real-time sentiment prediction
  • Confidence scores with visual charts
  • Memory usage monitoring
  • Example texts for quick testing

πŸ“Š Batch Analysis

  • Process multiple texts at once
  • Memory-efficient batch processing
  • Automatic batch size limits
  • Batch summary with sentiment distribution

πŸ›‘οΈ Memory Management

  • Automatic Cleanup: Memory cleaned after each prediction
  • Batch Limits: Configurable maximum texts per batch
  • Memory Monitoring: Real-time memory usage tracking
  • GPU Optimization: CUDA cache clearing when available
  • Quantization: Optional model quantization for CPU (~4x memory reduction)

ℹ️ Model Information

  • Detailed model specifications
  • Performance metrics
  • Memory management settings
  • Usage tips and troubleshooting

πŸ”§ Command Line Options

Individual Scripts

train.py

python train.py [batch_size] [epochs]

test.py

python test.py [model_path]

demo.py

python demo.py

web.py

python web.py [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST]

Main Entry Point (main.py)

Training Command

python main.py train [--batch-size SIZE] [--epochs NUM] [--learning-rate RATE]

Testing Command

python main.py test [--model-path PATH]

Demo Command

python main.py demo

Web Interface Command

python main.py web [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST]

Memory Management Options:

  • --max-batch-size: Maximum batch size for memory efficiency (default: 10)
  • --quantize: Enable model quantization for memory efficiency (CPU only)
  • --max-memory: Maximum memory usage in MB (default: 4096)
  • --port: Port to run the interface on (default: 7862)
  • --host: Host to bind the interface to (default: 127.0.0.1)

πŸ“Š Model Details

  • Base Model: 5CD-AI/Vietnamese-Sentiment-visobert
  • Dataset: uitnlp/vietnamese_students_feedback
  • Labels: Negative, Neutral, Positive
  • Language: Vietnamese
  • Architecture: Transformer-based sequence classification
  • Max Sequence Length: 512 tokens

πŸ“ˆ Performance Metrics

  • Accuracy: 85-90% (on validation set)
  • Processing Speed: ~100ms per text
  • Memory Usage: Configurable (default 4GB limit)
  • Batch Processing: Up to 20 texts (configurable)

πŸ›‘οΈ Memory Management

The system includes comprehensive memory management:

Automatic Features

  • Memory cleanup after each prediction
  • GPU cache clearing for CUDA
  • Garbage collection management
  • Memory monitoring before/after operations

User Controls

  • Configurable batch size limits
  • Memory limit enforcement
  • Manual memory cleanup button
  • Real-time memory usage display

Optimization Options

  • Dynamic quantization (CPU only)
  • Batch processing optimization
  • Memory-efficient inference

πŸ” Troubleshooting

Memory Issues

  • Enable quantization: python gradio_app.py --quantize
  • Reduce batch size: python gradio_app.py --max-batch-size 5
  • Lower memory limit: python gradio_app.py --max-memory 2048
  • Use manual cleanup: Click "Memory Cleanup" button in web interface

Model Loading Issues

  • Ensure model is trained: python run_training.py
  • Check model directory: ls -la vietnamese_sentiment_finetuned/
  • Verify dependencies: pip install -r requirements.txt

Performance Optimization

  • Use GPU if available (CUDA)
  • Enable quantization for CPU inference
  • Monitor memory usage in web interface
  • Adjust batch size based on available memory

πŸ“„ Requirements

See requirements.txt for complete dependency list:

torch>=2.0.0
transformers>=4.21.0
datasets>=2.0.0
gradio>=4.0.0
pandas>=1.5.0
numpy>=1.21.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
psutil>=5.9.0

🎯 Example Usage

Command Line Demo

from py.demo import SentimentDemo

demo = SentimentDemo()
demo.load_model()
demo.interactive_demo()

Web Interface

  1. Train model: python train.py
  2. Launch interface: python web.py
  3. Open browser to http://127.0.0.1:7862
  4. Enter Vietnamese text for analysis

Batch Processing

from py.gradio_app import SentimentGradioApp

app = SentimentGradioApp(max_batch_size=20)
app.load_model()
texts = ["Tuyệt vời!", "BΓ¬nh thường", "RαΊ₯t tệ"]
results, summary = app.batch_predict(texts)

Model Testing

from py.test_model import SentimentTester

tester = SentimentTester(model_path="./vietnamese_sentiment_finetuned")
tester.load_model()
sentiment, confidence = tester.predict_sentiment("GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay!")

Fine-Tuning

from py.fine_tune_sentiment import SentimentFineTuner

fine_tuner = SentimentFineTuner(
    model_name="5CD-AI/Vietnamese-Sentiment-visobert",
    dataset_name="uitnlp/vietnamese_students_feedback"
)
train_result, eval_results = fine_tuner.run_fine_tuning(
    output_dir="./my_model",
    learning_rate=2e-5,
    batch_size=16,
    num_epochs=3
)

πŸ“ Model Loading Examples

Loading the Fine-Tuned Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./vietnamese_sentiment_finetuned")
model = AutoModelForSequenceClassification.from_pretrained("./vietnamese_sentiment_finetuned")

Making Predictions

import torch

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()

    sentiment_labels = ["Negative", "Neutral", "Positive"]
    return sentiment_labels[predicted_class], predictions[0][predicted_class].item()

# Example
text = "GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay vΓ  tΓ’m huyαΊΏt."
sentiment, confidence = predict_sentiment(text)
print(f"Sentiment: {sentiment}, Confidence: {confidence:.3f}")

πŸ“Š Dataset Information

The UIT-VSFC corpus contains over 16,000 Vietnamese student feedback sentences with:

  • Sentiment Classification: Positive, Neutral, Negative
  • Topic Classification: Various educational topics
  • Inter-annotator agreement: >91% for sentiment, >71% for topics
  • Original F1-score: ~88% for sentiment (Maximum Entropy baseline)

πŸ”§ Hardware Requirements

  • Minimum: 8GB RAM, CPU
  • Recommended: GPU with 8GB+ VRAM for faster training
  • Storage: ~2GB for model and datasets

πŸ“ License

This project uses open-source components for educational and research purposes. Please check individual licenses for:

  • 5CD-AI/Vietnamese-Sentiment-visobert
  • uitnlp/vietnamese_students_feedback

🀝 Contributing

Feel free to submit issues and enhancement requests!

πŸ“„ Citation

If you use this work or the dataset, please cite:

@InProceedings{8573337,
  author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},
  booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},
  title={UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis},
  year={2018},
  volume={},
  number={},
  pages={19-24},
  doi={10.1109/KSE.2018.8573337}
}

Quick Start: python train.py && python web.py

Alternative: python main.py train && python main.py web