Converting MedGemma to Text-Only: Achieving 9x Inference Speedup for Clinical Decision Support
Authors: Electric Sheep Africa
Date: January 2026
Keywords: MedGemma, vLLM, inference optimization, multimodal models, healthcare AI
Abstract
MedGemma, Google's medical-domain large language model based on Gemma 3, offers superior clinical reasoning capabilities but suffers from slow inference times (~22 seconds) due to its multimodal architecture. This paper presents a novel approach to convert MedGemma from its multimodal Gemma3ForConditionalGeneration architecture to a text-only Gemma3ForCausalLM variant, enabling compatibility with optimized inference engines like vLLM. Our conversion process achieves 9x inference speedup (from ~22s to ~2.4s) while preserving the model's medical knowledge, making it practical for real-time clinical decision support in low-resource healthcare settings.
1. Introduction
1.1 Background
Large Language Models (LLMs) are increasingly being deployed in healthcare settings to assist clinical decision-making. MedGemma, released by Google in 2025, represents a significant advancement in medical AI, offering pre-trained knowledge of clinical terminology, diagnostic reasoning, and treatment protocols.
However, deploying MedGemma in production environments, particularly in low-resource settings common across sub-Saharan Africa, presents significant challenges:
- Slow inference times: The multimodal architecture adds computational overhead even for text-only queries
- Limited infrastructure compatibility: Optimized inference engines (vLLM, TGI) don't fully support multimodal Gemma 3
- Resource constraints: Healthcare facilities in developing regions often have limited computational resources
1.2 Problem Statement
MedGemma uses the Gemma3ForConditionalGeneration architecture, which includes:
- A SigLIP vision encoder (~400M parameters)
- A multi-modal projector
- A language model backbone (~3.6B parameters)
For text-only clinical queries (the primary use case for Community Health Worker assistants), the vision components are unused but still impose:
- Memory overhead from loading vision weights
- Incompatibility with vLLM's optimized text generation
- Slower tokenization through the multimodal processor
1.3 Contribution
We present a conversion methodology that:
- Extracts the language model backbone from MedGemma
- Removes vision tower weights and the
language_model.prefix - Reconfigures the model for
Gemma3ForCausalLMarchitecture - Enables deployment with vLLM for optimized inference
2. Related Work
2.1 Gemma 3 Architecture
Gemma 3, released by Google DeepMind in March 2025, introduced a unified architecture supporting both text-only and multimodal inference:
| Class | Use Case | Vision Support |
|---|---|---|
Gemma3ForCausalLM |
Text-only generation | No |
Gemma3ForConditionalGeneration |
Multimodal (text + images) | Yes |
The HuggingFace documentation notes: "Gemma3ForCausalLM can be used to load the vision language models like they were language models (omitting the vision tower)."
2.2 vLLM and Optimized Inference
vLLM (Virtual Large Language Model) provides significant inference optimizations through:
- PagedAttention: Efficient KV cache memory management
- Continuous batching: Dynamic request batching
- CUDA graph optimization: Reduced kernel launch overhead
However, vLLM's support for multimodal models requires additional components (image processors, vision encoders) that add complexity and limit optimization potential.
2.3 Medical LLM Deployment Challenges
Previous work on medical LLM deployment has focused on:
- Quantization (4-bit, 8-bit) for memory reduction
- Knowledge distillation to smaller models
- Domain-specific fine-tuning
Our approach is complementary, focusing on architectural simplification rather than model compression.
3. Methodology
3.1 Weight Analysis
We analyzed the MedGemma weight structure using safetensors inspection:
from safetensors.torch import load_file
weights = load_file("model.safetensors")
for key in weights.keys():
print(key)
Findings:
| Weight Prefix | Parameters | Purpose |
|---|---|---|
vision_tower.* |
~400M | SigLIP image encoder |
multi_modal_projector.* |
~10M | Vision-language alignment |
language_model.model.* |
~3.6B | Text generation backbone |
language_model.lm_head.* |
~100M | Output projection |
3.2 Conversion Process
Our conversion involves three steps:
Step 1: Weight Extraction and Renaming
new_weights = OrderedDict()
for key, tensor in original_weights.items():
# Skip vision components
if key.startswith('vision_tower.') or key.startswith('multi_modal_projector.'):
continue
# Strip language_model. prefix
if key.startswith('language_model.'):
new_key = key.replace('language_model.', '', 1)
else:
new_key = key
new_weights[new_key] = tensor
Step 2: Configuration Transformation
The multimodal config structure:
{
"architectures": ["Gemma3ForConditionalGeneration"],
"model_type": "gemma3",
"text_config": { ... },
"vision_config": { ... }
}
Becomes text-only config:
{
"architectures": ["Gemma3ForCausalLM"],
"model_type": "gemma3_text",
"vocab_size": 262144,
"hidden_size": 2560,
"num_hidden_layers": 34,
...
}
Step 3: Tokenizer Preservation
The tokenizer files remain unchanged, as MedGemma uses the same tokenizer for text processing regardless of vision capabilities.
3.3 Validation
We validate the conversion by:
- Loading with
AutoModelForCausalLM - Comparing output distributions on identical prompts
- Measuring inference latency
4. Experimental Setup
4.1 Hardware
| Configuration | GPU | Memory | Cost/hr |
|---|---|---|---|
| Baseline | NVIDIA A100 80GB | 80GB HBM2e | ~$2.00 |
| Comparison | NVIDIA L4 | 24GB | ~$0.80 |
4.2 Models
| Model | Architecture | Size | vLLM Compatible |
|---|---|---|---|
| chewie-merged | Gemma3ForConditionalGeneration | 4.3GB | No |
| chewie-text-only | Gemma3ForCausalLM | 3.2GB | Yes |
4.3 Evaluation Metrics
- Inference Latency: Time from request to complete response
- Throughput: Tokens generated per second
- Clinical Accuracy: Manual evaluation of diagnostic reasoning
- Memory Usage: Peak GPU memory during inference
5. Results
5.1 Inference Performance
| Model | Engine | Latency (250 tokens) | Tokens/sec |
|---|---|---|---|
| chewie-merged | Custom Handler | 22.9s | 10.9 |
| chewie-merged | vLLM | N/A (incompatible) | - |
| chewie-text-only | vLLM (HF Endpoints) | 2.4s | 104.2 |
| chewie-llama-merged | vLLM | 4.6s | 54.3 |
Key Finding: Converting to text-only architecture enables vLLM compatibility, achieving 9.5x speedup.
5.1.1 Production Deployment
The text-only model is deployed on Hugging Face Inference Endpoints:
- Endpoint:
https://gcg0cdnosq6n7qqo.us-east-1.aws.endpoints.huggingface.cloud - Container: vLLM TGI (Text Generation Inference)
- API: OpenAI-compatible
/v1/completionsendpoint - Measured Latency: 2.4 seconds for 250 tokens
5.2 Memory Reduction
| Model | Weights Size | GPU Memory (Inference) |
|---|---|---|
| chewie-merged | 4.3GB | ~12GB |
| chewie-text-only | 3.2GB | ~8GB |
The removal of vision components reduces model size by 25%.
5.3 Clinical Quality Assessment
We evaluated both models on 50 clinical scenarios covering:
- Pediatric emergencies
- Maternal health
- Infectious diseases
- Chronic conditions
| Metric | chewie-merged | chewie-text-only |
|---|---|---|
| Correct Diagnosis | 92% | 92% |
| Appropriate Referral | 96% | 96% |
| Danger Sign Detection | 98% | 98% |
| Hallucination Rate | 2% | 2% |
Clinical quality is preserved after conversion, as the language model weights remain unchanged.
5.4 Example Output Comparison
Input: "A pregnant woman at 32 weeks has severe headaches, blurred vision, and swelling in her hands and face. BP is 160/110 with protein in urine."
chewie-merged (22.9s):
Assessment: Severe preeclampsia - presenting with hypertension, proteinuria, headaches, visual disturbances, and edema at 32 weeks gestation. Action: Immediate Referral - This is a medical emergency...
chewie-text-only (2.4s):
Assessment: Severe preeclampsia - presenting with hypertension (160/110), proteinuria, severe headache, visual changes, and facial/hand edema at 32 weeks. Action: Immediate Referral - Medical emergency requiring urgent obstetric care...
Outputs are clinically equivalent, with the text-only version generating in 9.5x less time.
6. Discussion
6.1 Why This Works
The multimodal Gemma 3 architecture keeps the language model as a separate submodule (language_model.*), making extraction straightforward. The vision tower is only connected through the multi-modal projector, which is unused for text-only inputs.
6.2 Limitations
- Loss of Vision Capability: The converted model cannot process images
- Architecture Specificity: This approach is specific to Gemma 3's modular design
- Fine-tuning Preservation: Models fine-tuned on multimodal data may lose some learned associations
6.3 Broader Implications
This technique can be applied to other multimodal models with similar architectures:
- LLaVA variants
- Qwen-VL
- Future multimodal medical models
6.4 Deployment Recommendations
For clinical decision support systems in low-resource settings:
| Use Case | Recommended Model | Expected Latency |
|---|---|---|
| Text-only queries | chewie-text-only + vLLM | ~2.4s |
| Image analysis needed | chewie-merged + Custom Handler | ~22s |
| Lowest latency required | chewie-text-only + vLLM | ~2.4s |
| Highest clinical accuracy | chewie-text-only + vLLM | ~2.4s |
7. Conclusion
We have demonstrated that MedGemma can be converted from a multimodal to text-only architecture, enabling:
- 9.5x inference speedup (22.9s โ 2.4s)
- 25% memory reduction (4.3GB โ 3.2GB)
- vLLM compatibility for production deployment on HF Inference Endpoints
- Preserved clinical accuracy (92% diagnostic accuracy maintained)
- OpenAI-compatible API via
/v1/completionsendpoint
This conversion makes MedGemma practical for real-time clinical decision support, particularly valuable in healthcare settings where response time directly impacts patient care. The 2.4-second response time enables natural conversational interactions between Community Health Workers and the AI assistant.
References
Google DeepMind. (2025). MedGemma: Medical Domain Language Model. Google AI Blog.
Google DeepMind. (2025). Gemma 3: Multimodal, Multilingual, Long Context Open LLM. arXiv:2503.xxxxx.
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP '23.
HuggingFace. (2025). Gemma 3 Documentation. https://huggingface.co/docs/transformers/model_doc/gemma3
vLLM Project. (2025). Supported Models. https://docs.vllm.ai/models/supported_models
Appendix A: Weight Mapping
| Original Key | Converted Key |
|---|---|
language_model.model.embed_tokens.weight |
model.embed_tokens.weight |
language_model.model.layers.0.self_attn.q_proj.weight |
model.layers.0.self_attn.q_proj.weight |
language_model.model.norm.weight |
model.norm.weight |
language_model.lm_head.weight |
lm_head.weight |
vision_tower.* |
(removed) |
multi_modal_projector.* |
(removed) |
Appendix B: Configuration Differences
Multimodal Config (Before)
{
"architectures": ["Gemma3ForConditionalGeneration"],
"model_type": "gemma3",
"text_config": {
"hidden_size": 2560,
"num_hidden_layers": 34,
"num_attention_heads": 10,
"num_key_value_heads": 2
},
"vision_config": {
"hidden_size": 1152,
"num_hidden_layers": 27,
"num_attention_heads": 16
}
}
Text-Only Config (After)
{
"architectures": ["Gemma3ForCausalLM"],
"model_type": "gemma3_text",
"hidden_size": 2560,
"num_hidden_layers": 34,
"num_attention_heads": 10,
"num_key_value_heads": 2,
"max_position_embeddings": 8192
}
Correspondence: research@electricsheepafrica.com
Chewie Text-Only (MedGemma)
Text-only version of Chewie/MedGemma for fast vLLM inference.
Performance
| Model | Architecture | vLLM | Speed |
|---|---|---|---|
| chewie-merged | Gemma3ForConditionalGeneration | โ | ~22s |
| chewie-text-only | Gemma3ForCausalLM | โ | ~5s |
Usage with vLLM
from openai import OpenAI
client = OpenAI(
base_url="YOUR_ENDPOINT/v1/",
api_key="YOUR_TOKEN"
)
response = client.chat.completions.create(
model="electricsheepafrica/chewie-text-only",
messages=[{"role": "user", "content": "Child has fever for 3 days"}],
max_tokens=200,
temperature=0.3
)
print(response.choices[0].message.content)
What Changed
- Removed vision tower (~1GB saved)
- Changed architecture to Gemma3ForCausalLM
- Stripped
language_model.prefix from weights - Reduced max_position_embeddings to 8192
- Downloads last month
- 47