--- title: Multimodal Gemma-270M Demo emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.46.1 app_file: app.py pinned: false license: mit --- # Multimodal Gemma-270M Demo A live demo of the multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture. ## Model Info - **Base Model**: Google Gemma-270M (270 million parameters) - **Vision Encoder**: CLIP ViT-Large/14@336px - **Architecture**: LLaVA-style vision-language fusion - **Training**: 7 epochs on LLaVA-150K dataset - **Trainable Parameters**: 18.6M / 539M total ## Features - 🖼️ **Image Understanding**: Upload any image and ask questions about it - 💬 **Conversational AI**: Natural language responses about visual content - 🎯 **Instruction Following**: Follows specific questions and prompts - ⚙️ **Adjustable Parameters**: Control response length and creativity ## Usage 1. **Load Model**: Click "🚀 Load Model" to download and initialize the model 2. **Upload Image**: Use the image upload area to select your image 3. **Ask Questions**: Type your question in the text box 4. **Get Response**: The model will analyze the image and provide a response ## Example Questions - "What do you see in this image?" - "Describe the main objects in the picture" - "What colors are prominent in this image?" - "Are there any people in the image?" - "What's the setting or location?" ## Technical Details The model uses: - **Vision Processing**: CLIP for image encoding - **Language Generation**: Gemma-270M with LoRA fine-tuning - **Multimodal Fusion**: Trainable projection layer - **Quantization**: 4-bit for efficient inference ## Links - **Model Repository**: [sagar007/multimodal-gemma-270m-llava](https://huggingface.co/sagar007/multimodal-gemma-270m-llava) - **Source Code**: [GitHub](https://github.com/sagar431/multimodal-gemma-270m) --- *Built with [Claude Code](https://claude.ai/code)*