# 🎭 Vietnamese Sentiment Analysis A comprehensive Vietnamese sentiment analysis system built with transformer models, featuring training, testing, demo, and web interface capabilities with advanced memory management. ## πŸš€ Features - **πŸ€– Transformer-based Model**: Fine-tuned Vietnamese sentiment analysis using Visobert - **🌐 Interactive Web Interface**: Real-time sentiment analysis via Gradio with memory optimization - **πŸ“Š Comprehensive Testing**: Model evaluation with confusion matrix and classification metrics - **⚑ Memory Efficient**: Built-in memory management, batch processing limits, and quantization support - **🎯 Easy to Use**: Simple command-line interface and web UI - **πŸ“ˆ Performance Monitoring**: Real-time memory usage tracking and optimization ## πŸ“ Project Structure ``` SentimentAnalysis/ β”œβ”€β”€ README.md # πŸ“š This file β”œβ”€β”€ requirements.txt # πŸ“¦ Python dependencies β”œβ”€β”€ .gitignore # 🚫 Git ignore rules β”‚ β”œβ”€β”€ py/ # 🐍 Core Python modules β”‚ β”œβ”€β”€ __init__.py # Package initialization β”‚ β”œβ”€β”€ fine_tune_sentiment.py # πŸ”§ Core fine-tuning utilities β”‚ β”œβ”€β”€ test_model.py # πŸ§ͺ Model testing and evaluation β”‚ β”œβ”€β”€ demo.py # πŸ’» Demo functionality β”‚ └── gradio_app.py # 🌐 Web interface (memory-optimized) β”‚ β”œβ”€β”€ main.py # πŸš€ Main entry point (all commands) β”œβ”€β”€ train.py # πŸ‹οΈ Training script β”œβ”€β”€ test.py # πŸ§ͺ Testing script β”œβ”€β”€ demo.py # πŸ’» Interactive demo └── web.py # 🌐 Web interface launcher β”‚ β”œβ”€β”€ vietnamese_sentiment_finetuned/ # πŸ€– Trained model (auto-generated) β”œβ”€β”€ confusion_matrix.png # πŸ“Š Evaluation visualization (auto-generated) β”œβ”€β”€ training_history.png # πŸ“ˆ Training progress (auto-generated) β”œβ”€β”€ pdf/ # πŸ“„ Documentation folder β”œβ”€β”€ venv/ # 🐍 Virtual environment β”œβ”€β”€ .git/ # πŸ“ Git repository └── .claude/ # πŸ€– Claude configuration ``` ## πŸ› οΈ Installation 1. **Clone and Setup Environment** ```bash cd SentimentAnalysis python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. **Install Dependencies** ```bash pip install -r requirements.txt ``` ## 🎯 Usage ### Quick Start Options #### **Option 1: Use Individual Scripts** ```bash # Train the model python train.py # Test the model python test.py # Run interactive demo python demo.py # Launch web interface python web.py ``` #### **Option 2: Use Main Entry Point** ```bash # Train with custom settings python main.py train --batch-size 32 --epochs 5 # Test the model python main.py test --model-path ./vietnamese_sentiment_finetuned # Run interactive demo python main.py demo # Launch web interface with memory options python main.py web --quantize --max-batch-size 20 --port 8080 ``` ### 1. Training the Model ```bash # Basic training python train.py # Custom batch size and epochs python train.py 32 5 # Using main script python main.py train --batch-size 32 --epochs 5 --learning-rate 1e-5 ``` ### 2. Testing the Model ```bash # Basic testing python test.py # Test with custom model path python test.py /path/to/custom/model # Using main script python main.py test --model-path ./vietnamese_sentiment_finetuned ``` ### 3. Interactive Demo ```bash # Run demo python demo.py # Using main script python main.py demo ``` ### 4. Web Interface ```bash # Standard usage (memory-efficient defaults) python web.py # High memory efficiency (quantization + small batches) python web.py --quantize --max-batch-size 5 --max-memory 2048 # Large batch processing python web.py --max-batch-size 20 --max-memory 8192 # Custom server configuration python web.py --port 8080 --host 0.0.0.0 --quantize # Using main script python main.py web --quantize --max-batch-size 20 --port 8080 ``` ## 🌐 Web Interface Features The Gradio web interface provides: ### πŸ“ Single Text Analysis - Real-time sentiment prediction - Confidence scores with visual charts - Memory usage monitoring - Example texts for quick testing ### πŸ“Š Batch Analysis - Process multiple texts at once - Memory-efficient batch processing - Automatic batch size limits - Batch summary with sentiment distribution ### πŸ›‘οΈ Memory Management - **Automatic Cleanup**: Memory cleaned after each prediction - **Batch Limits**: Configurable maximum texts per batch - **Memory Monitoring**: Real-time memory usage tracking - **GPU Optimization**: CUDA cache clearing when available - **Quantization**: Optional model quantization for CPU (~4x memory reduction) ### ℹ️ Model Information - Detailed model specifications - Performance metrics - Memory management settings - Usage tips and troubleshooting ## πŸ”§ Command Line Options ### Individual Scripts #### `train.py` ```bash python train.py [batch_size] [epochs] ``` #### `test.py` ```bash python test.py [model_path] ``` #### `demo.py` ```bash python demo.py ``` #### `web.py` ```bash python web.py [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST] ``` ### Main Entry Point (`main.py`) #### Training Command ```bash python main.py train [--batch-size SIZE] [--epochs NUM] [--learning-rate RATE] ``` #### Testing Command ```bash python main.py test [--model-path PATH] ``` #### Demo Command ```bash python main.py demo ``` #### Web Interface Command ```bash python main.py web [--max-batch-size SIZE] [--quantize] [--max-memory MB] [--port PORT] [--host HOST] ``` **Memory Management Options:** - `--max-batch-size`: Maximum batch size for memory efficiency (default: 10) - `--quantize`: Enable model quantization for memory efficiency (CPU only) - `--max-memory`: Maximum memory usage in MB (default: 4096) - `--port`: Port to run the interface on (default: 7862) - `--host`: Host to bind the interface to (default: 127.0.0.1) ## πŸ“Š Model Details - **Base Model**: 5CD-AI/Vietnamese-Sentiment-visobert - **Dataset**: uitnlp/vietnamese_students_feedback - **Labels**: Negative, Neutral, Positive - **Language**: Vietnamese - **Architecture**: Transformer-based sequence classification - **Max Sequence Length**: 512 tokens ## πŸ“ˆ Performance Metrics - **Accuracy**: 85-90% (on validation set) - **Processing Speed**: ~100ms per text - **Memory Usage**: Configurable (default 4GB limit) - **Batch Processing**: Up to 20 texts (configurable) ## πŸ›‘οΈ Memory Management The system includes comprehensive memory management: ### Automatic Features - Memory cleanup after each prediction - GPU cache clearing for CUDA - Garbage collection management - Memory monitoring before/after operations ### User Controls - Configurable batch size limits - Memory limit enforcement - Manual memory cleanup button - Real-time memory usage display ### Optimization Options - Dynamic quantization (CPU only) - Batch processing optimization - Memory-efficient inference ## πŸ” Troubleshooting ### Memory Issues - Enable quantization: `python gradio_app.py --quantize` - Reduce batch size: `python gradio_app.py --max-batch-size 5` - Lower memory limit: `python gradio_app.py --max-memory 2048` - Use manual cleanup: Click "Memory Cleanup" button in web interface ### Model Loading Issues - Ensure model is trained: `python run_training.py` - Check model directory: `ls -la vietnamese_sentiment_finetuned/` - Verify dependencies: `pip install -r requirements.txt` ### Performance Optimization - Use GPU if available (CUDA) - Enable quantization for CPU inference - Monitor memory usage in web interface - Adjust batch size based on available memory ## πŸ“„ Requirements See `requirements.txt` for complete dependency list: ``` torch>=2.0.0 transformers>=4.21.0 datasets>=2.0.0 gradio>=4.0.0 pandas>=1.5.0 numpy>=1.21.0 scikit-learn>=1.1.0 matplotlib>=3.5.0 seaborn>=0.11.0 psutil>=5.9.0 ``` ## 🎯 Example Usage ### Command Line Demo ```python from py.demo import SentimentDemo demo = SentimentDemo() demo.load_model() demo.interactive_demo() ``` ### Web Interface 1. Train model: `python train.py` 2. Launch interface: `python web.py` 3. Open browser to `http://127.0.0.1:7862` 4. Enter Vietnamese text for analysis ### Batch Processing ```python from py.gradio_app import SentimentGradioApp app = SentimentGradioApp(max_batch_size=20) app.load_model() texts = ["Tuyệt vời!", "BΓ¬nh thường", "RαΊ₯t tệ"] results, summary = app.batch_predict(texts) ``` ### Model Testing ```python from py.test_model import SentimentTester tester = SentimentTester(model_path="./vietnamese_sentiment_finetuned") tester.load_model() sentiment, confidence = tester.predict_sentiment("GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay!") ``` ### Fine-Tuning ```python from py.fine_tune_sentiment import SentimentFineTuner fine_tuner = SentimentFineTuner( model_name="5CD-AI/Vietnamese-Sentiment-visobert", dataset_name="uitnlp/vietnamese_students_feedback" ) train_result, eval_results = fine_tuner.run_fine_tuning( output_dir="./my_model", learning_rate=2e-5, batch_size=16, num_epochs=3 ) ``` ## πŸ“ Model Loading Examples ### Loading the Fine-Tuned Model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("./vietnamese_sentiment_finetuned") model = AutoModelForSequenceClassification.from_pretrained("./vietnamese_sentiment_finetuned") ``` ### Making Predictions ```python import torch def predict_sentiment(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) predictions = torch.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1).item() sentiment_labels = ["Negative", "Neutral", "Positive"] return sentiment_labels[predicted_class], predictions[0][predicted_class].item() # Example text = "GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay vΓ  tΓ’m huyαΊΏt." sentiment, confidence = predict_sentiment(text) print(f"Sentiment: {sentiment}, Confidence: {confidence:.3f}") ``` ## πŸ“Š Dataset Information The UIT-VSFC corpus contains over 16,000 Vietnamese student feedback sentences with: - **Sentiment Classification**: Positive, Neutral, Negative - **Topic Classification**: Various educational topics - **Inter-annotator agreement**: >91% for sentiment, >71% for topics - **Original F1-score**: ~88% for sentiment (Maximum Entropy baseline) ## πŸ”§ Hardware Requirements - **Minimum**: 8GB RAM, CPU - **Recommended**: GPU with 8GB+ VRAM for faster training - **Storage**: ~2GB for model and datasets ## πŸ“ License This project uses open-source components for educational and research purposes. Please check individual licenses for: - 5CD-AI/Vietnamese-Sentiment-visobert - uitnlp/vietnamese_students_feedback ## 🀝 Contributing Feel free to submit issues and enhancement requests! ## πŸ“„ Citation If you use this work or the dataset, please cite: ```bibtex @InProceedings{8573337, author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy}, booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)}, title={UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis}, year={2018}, volume={}, number={}, pages={19-24}, doi={10.1109/KSE.2018.8573337} } ``` --- **Quick Start**: `python train.py && python web.py` **Alternative**: `python main.py train && python main.py web`