You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Multimodal Gemma 270M - LLaVA Architecture

This is a multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.

Model Details

  • Base Model: Google Gemma-270M (270 million parameters)
  • Vision Encoder: CLIP ViT-Large/14@336px
  • Architecture: LLaVA-style vision-language fusion
  • Training: 7 epochs on LLaVA-150K dataset
  • Trainable Parameters: 18.6M / 539M total
  • Quantization: 4-bit with LoRA fine-tuning

Usage

from src.models import MultimodalGemmaLightning
from src.utils.config import load_config, merge_configs

# Load config
model_config = load_config("configs/model_config.yaml")
training_config = load_config("configs/training_config.yaml")
data_config = load_config("configs/data_config.yaml")
config = merge_configs([model_config, training_config, data_config])

# Load model
model = MultimodalGemmaLightning.load_from_checkpoint(
    "final_model.ckpt",
    config=config,
    strict=False
)
model.eval()

Training Details

  • Dataset: LLaVA-150K instruction tuning dataset
  • Training Time: ~12 hours on A100 GPU
  • Loss: Converged from 3.3 to stable training loss
  • Vision-Language Fusion: Image tokens replace placeholders

Capabilities

This model can:

  • Process images and answer questions about them
  • Describe visual content in images
  • Follow vision-language instructions
  • Generate relevant text based on image content

Note: As a small model with limited training, responses may be simple but are contextually relevant to the input images.

Demo

Try the live demo: Gradio Space

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using sagar007/multimodal-gemma-270m-llava 1