You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Multimodal Gemma 270M - LLaVA Architecture

This is a multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.

Model Details

Base Model: Google Gemma-270M (270 million parameters)
Vision Encoder: CLIP ViT-Large/14@336px
Architecture: LLaVA-style vision-language fusion
Training: 7 epochs on LLaVA-150K dataset
Trainable Parameters: 18.6M / 539M total
Quantization: 4-bit with LoRA fine-tuning

Usage

from src.models import MultimodalGemmaLightning
from src.utils.config import load_config, merge_configs

# Load config
model_config = load_config("configs/model_config.yaml")
training_config = load_config("configs/training_config.yaml")
data_config = load_config("configs/data_config.yaml")
config = merge_configs([model_config, training_config, data_config])

# Load model
model = MultimodalGemmaLightning.load_from_checkpoint(
    "final_model.ckpt",
    config=config,
    strict=False
)
model.eval()

Training Details

Dataset: LLaVA-150K instruction tuning dataset
Training Time: ~12 hours on A100 GPU
Loss: Converged from 3.3 to stable training loss
Vision-Language Fusion: Image tokens replace placeholders

Capabilities

This model can:

Process images and answer questions about them
Describe visual content in images
Follow vision-language instructions
Generate relevant text based on image content

Note: As a small model with limited training, responses may be simple but are contextually relevant to the input images.

Demo

Try the live demo: Gradio Space

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

sagar007
/

multimodal-gemma-270m-llava