YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Multimodal Gemma 270M - LLaVA Architecture
This is a multimodal vision-language model based on Google's Gemma-270M, trained using the LLaVA architecture.
Model Details
- Base Model: Google Gemma-270M (270 million parameters)
- Vision Encoder: CLIP ViT-Large/14@336px
- Architecture: LLaVA-style vision-language fusion
- Training: 7 epochs on LLaVA-150K dataset
- Trainable Parameters: 18.6M / 539M total
- Quantization: 4-bit with LoRA fine-tuning
Usage
from src.models import MultimodalGemmaLightning
from src.utils.config import load_config, merge_configs
# Load config
model_config = load_config("configs/model_config.yaml")
training_config = load_config("configs/training_config.yaml")
data_config = load_config("configs/data_config.yaml")
config = merge_configs([model_config, training_config, data_config])
# Load model
model = MultimodalGemmaLightning.load_from_checkpoint(
"final_model.ckpt",
config=config,
strict=False
)
model.eval()
Training Details
- Dataset: LLaVA-150K instruction tuning dataset
- Training Time: ~12 hours on A100 GPU
- Loss: Converged from 3.3 to stable training loss
- Vision-Language Fusion: Image tokens replace
placeholders
Capabilities
This model can:
- Process images and answer questions about them
- Describe visual content in images
- Follow vision-language instructions
- Generate relevant text based on image content
Note: As a small model with limited training, responses may be simple but are contextually relevant to the input images.
Demo
Try the live demo: Gradio Space
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support