Converting MedGemma to Text-Only: Achieving 9x Inference Speedup for Clinical Decision Support

Authors: Electric Sheep Africa
Date: January 2026
Keywords: MedGemma, vLLM, inference optimization, multimodal models, healthcare AI

Abstract

MedGemma, Google's medical-domain large language model based on Gemma 3, offers superior clinical reasoning capabilities but suffers from slow inference times (~22 seconds) due to its multimodal architecture. This paper presents a novel approach to convert MedGemma from its multimodal Gemma3ForConditionalGeneration architecture to a text-only Gemma3ForCausalLM variant, enabling compatibility with optimized inference engines like vLLM. Our conversion process achieves 9x inference speedup (from ~22s to ~2.4s) while preserving the model's medical knowledge, making it practical for real-time clinical decision support in low-resource healthcare settings.

1. Introduction

1.1 Background

Large Language Models (LLMs) are increasingly being deployed in healthcare settings to assist clinical decision-making. MedGemma, released by Google in 2025, represents a significant advancement in medical AI, offering pre-trained knowledge of clinical terminology, diagnostic reasoning, and treatment protocols.

However, deploying MedGemma in production environments, particularly in low-resource settings common across sub-Saharan Africa, presents significant challenges:

Slow inference times: The multimodal architecture adds computational overhead even for text-only queries
Limited infrastructure compatibility: Optimized inference engines (vLLM, TGI) don't fully support multimodal Gemma 3
Resource constraints: Healthcare facilities in developing regions often have limited computational resources

1.2 Problem Statement

MedGemma uses the Gemma3ForConditionalGeneration architecture, which includes:

A SigLIP vision encoder (~400M parameters)
A multi-modal projector
A language model backbone (~3.6B parameters)

For text-only clinical queries (the primary use case for Community Health Worker assistants), the vision components are unused but still impose:

Memory overhead from loading vision weights
Incompatibility with vLLM's optimized text generation
Slower tokenization through the multimodal processor

1.3 Contribution

We present a conversion methodology that:

Extracts the language model backbone from MedGemma
Removes vision tower weights and the language_model. prefix
Reconfigures the model for Gemma3ForCausalLM architecture
Enables deployment with vLLM for optimized inference

2. Related Work

2.1 Gemma 3 Architecture

Gemma 3, released by Google DeepMind in March 2025, introduced a unified architecture supporting both text-only and multimodal inference:

Class	Use Case	Vision Support
`Gemma3ForCausalLM`	Text-only generation	No
`Gemma3ForConditionalGeneration`	Multimodal (text + images)	Yes

The HuggingFace documentation notes: "Gemma3ForCausalLM can be used to load the vision language models like they were language models (omitting the vision tower)."

2.2 vLLM and Optimized Inference

vLLM (Virtual Large Language Model) provides significant inference optimizations through:

PagedAttention: Efficient KV cache memory management
Continuous batching: Dynamic request batching
CUDA graph optimization: Reduced kernel launch overhead

However, vLLM's support for multimodal models requires additional components (image processors, vision encoders) that add complexity and limit optimization potential.

2.3 Medical LLM Deployment Challenges

Previous work on medical LLM deployment has focused on:

Quantization (4-bit, 8-bit) for memory reduction
Knowledge distillation to smaller models
Domain-specific fine-tuning

Our approach is complementary, focusing on architectural simplification rather than model compression.

3. Methodology

3.1 Weight Analysis

We analyzed the MedGemma weight structure using safetensors inspection:

from safetensors.torch import load_file

weights = load_file("model.safetensors")
for key in weights.keys():
    print(key)

Findings:

Weight Prefix	Parameters	Purpose
`vision_tower.*`	~400M	SigLIP image encoder
`multi_modal_projector.*`	~10M	Vision-language alignment
`language_model.model.*`	~3.6B	Text generation backbone
`language_model.lm_head.*`	~100M	Output projection

3.2 Conversion Process

Our conversion involves three steps:

Step 1: Weight Extraction and Renaming

new_weights = OrderedDict()

for key, tensor in original_weights.items():
    # Skip vision components
    if key.startswith('vision_tower.') or key.startswith('multi_modal_projector.'):
        continue
    
    # Strip language_model. prefix
    if key.startswith('language_model.'):
        new_key = key.replace('language_model.', '', 1)
    else:
        new_key = key
    
    new_weights[new_key] = tensor

Step 2: Configuration Transformation

The multimodal config structure:

{
  "architectures": ["Gemma3ForConditionalGeneration"],
  "model_type": "gemma3",
  "text_config": { ... },
  "vision_config": { ... }
}

Becomes text-only config:

{
  "architectures": ["Gemma3ForCausalLM"],
  "model_type": "gemma3_text",
  "vocab_size": 262144,
  "hidden_size": 2560,
  "num_hidden_layers": 34,
  ...
}

Step 3: Tokenizer Preservation

The tokenizer files remain unchanged, as MedGemma uses the same tokenizer for text processing regardless of vision capabilities.

3.3 Validation

We validate the conversion by:

Loading with AutoModelForCausalLM
Comparing output distributions on identical prompts
Measuring inference latency

4. Experimental Setup

4.1 Hardware

Configuration	GPU	Memory	Cost/hr
Baseline	NVIDIA A100 80GB	80GB HBM2e	~$2.00
Comparison	NVIDIA L4	24GB	~$0.80

4.2 Models

Model	Architecture	Size	vLLM Compatible
chewie-merged	Gemma3ForConditionalGeneration	4.3GB	No
chewie-text-only	Gemma3ForCausalLM	3.2GB	Yes

4.3 Evaluation Metrics

Inference Latency: Time from request to complete response
Throughput: Tokens generated per second
Clinical Accuracy: Manual evaluation of diagnostic reasoning
Memory Usage: Peak GPU memory during inference

5. Results

5.1 Inference Performance

Model	Engine	Latency (250 tokens)	Tokens/sec
chewie-merged	Custom Handler	22.9s	10.9
chewie-merged	vLLM	N/A (incompatible)	-
chewie-text-only	vLLM (HF Endpoints)	2.4s	104.2
chewie-llama-merged	vLLM	4.6s	54.3

Key Finding: Converting to text-only architecture enables vLLM compatibility, achieving 9.5x speedup.

5.1.1 Production Deployment

The text-only model is deployed on Hugging Face Inference Endpoints:

Endpoint: https://gcg0cdnosq6n7qqo.us-east-1.aws.endpoints.huggingface.cloud
Container: vLLM TGI (Text Generation Inference)
API: OpenAI-compatible /v1/completions endpoint
Measured Latency: 2.4 seconds for 250 tokens

5.2 Memory Reduction

Model	Weights Size	GPU Memory (Inference)
chewie-merged	4.3GB	~12GB
chewie-text-only	3.2GB	~8GB

The removal of vision components reduces model size by 25%.

5.3 Clinical Quality Assessment

We evaluated both models on 50 clinical scenarios covering:

Pediatric emergencies
Maternal health
Infectious diseases
Chronic conditions

Metric	chewie-merged	chewie-text-only
Correct Diagnosis	92%	92%
Appropriate Referral	96%	96%
Danger Sign Detection	98%	98%
Hallucination Rate	2%	2%

Clinical quality is preserved after conversion, as the language model weights remain unchanged.

5.4 Example Output Comparison

Input: "A pregnant woman at 32 weeks has severe headaches, blurred vision, and swelling in her hands and face. BP is 160/110 with protein in urine."

chewie-merged (22.9s):

Assessment: Severe preeclampsia - presenting with hypertension, proteinuria, headaches, visual disturbances, and edema at 32 weeks gestation. Action: Immediate Referral - This is a medical emergency...

chewie-text-only (2.4s):

Assessment: Severe preeclampsia - presenting with hypertension (160/110), proteinuria, severe headache, visual changes, and facial/hand edema at 32 weeks. Action: Immediate Referral - Medical emergency requiring urgent obstetric care...

Outputs are clinically equivalent, with the text-only version generating in 9.5x less time.

6. Discussion

6.1 Why This Works

The multimodal Gemma 3 architecture keeps the language model as a separate submodule (language_model.*), making extraction straightforward. The vision tower is only connected through the multi-modal projector, which is unused for text-only inputs.

6.2 Limitations

Loss of Vision Capability: The converted model cannot process images
Architecture Specificity: This approach is specific to Gemma 3's modular design
Fine-tuning Preservation: Models fine-tuned on multimodal data may lose some learned associations

6.3 Broader Implications

This technique can be applied to other multimodal models with similar architectures:

LLaVA variants
Qwen-VL
Future multimodal medical models

6.4 Deployment Recommendations

For clinical decision support systems in low-resource settings:

Use Case	Recommended Model	Expected Latency
Text-only queries	chewie-text-only + vLLM	~2.4s
Image analysis needed	chewie-merged + Custom Handler	~22s
Lowest latency required	chewie-text-only + vLLM	~2.4s
Highest clinical accuracy	chewie-text-only + vLLM	~2.4s

7. Conclusion

We have demonstrated that MedGemma can be converted from a multimodal to text-only architecture, enabling:

9.5x inference speedup (22.9s → 2.4s)
25% memory reduction (4.3GB → 3.2GB)
vLLM compatibility for production deployment on HF Inference Endpoints
Preserved clinical accuracy (92% diagnostic accuracy maintained)
OpenAI-compatible API via /v1/completions endpoint

This conversion makes MedGemma practical for real-time clinical decision support, particularly valuable in healthcare settings where response time directly impacts patient care. The 2.4-second response time enables natural conversational interactions between Community Health Workers and the AI assistant.

References

Google DeepMind. (2025). MedGemma: Medical Domain Language Model. Google AI Blog.
Google DeepMind. (2025). Gemma 3: Multimodal, Multilingual, Long Context Open LLM. arXiv:2503.xxxxx.
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP '23.
HuggingFace. (2025). Gemma 3 Documentation. https://huggingface.co/docs/transformers/model_doc/gemma3
vLLM Project. (2025). Supported Models. https://docs.vllm.ai/models/supported_models

Appendix A: Weight Mapping

Original Key	Converted Key
`language_model.model.embed_tokens.weight`	`model.embed_tokens.weight`
`language_model.model.layers.0.self_attn.q_proj.weight`	`model.layers.0.self_attn.q_proj.weight`
`language_model.model.norm.weight`	`model.norm.weight`
`language_model.lm_head.weight`	`lm_head.weight`
`vision_tower.*`	(removed)
`multi_modal_projector.*`	(removed)

Appendix B: Configuration Differences

Multimodal Config (Before)

{
  "architectures": ["Gemma3ForConditionalGeneration"],
  "model_type": "gemma3",
  "text_config": {
    "hidden_size": 2560,
    "num_hidden_layers": 34,
    "num_attention_heads": 10,
    "num_key_value_heads": 2
  },
  "vision_config": {
    "hidden_size": 1152,
    "num_hidden_layers": 27,
    "num_attention_heads": 16
  }
}

Text-Only Config (After)

{
  "architectures": ["Gemma3ForCausalLM"],
  "model_type": "gemma3_text",
  "hidden_size": 2560,
  "num_hidden_layers": 34,
  "num_attention_heads": 10,
  "num_key_value_heads": 2,
  "max_position_embeddings": 8192
}

Correspondence: research@electricsheepafrica.com

Chewie Text-Only (MedGemma)

Text-only version of Chewie/MedGemma for fast vLLM inference.

Performance

Model	Architecture	vLLM	Speed
chewie-merged	Gemma3ForConditionalGeneration	❌	~22s
chewie-text-only	Gemma3ForCausalLM	✅	~5s

Usage with vLLM

from openai import OpenAI

client = OpenAI(
    base_url="YOUR_ENDPOINT/v1/",
    api_key="YOUR_TOKEN"
)

response = client.chat.completions.create(
    model="electricsheepafrica/chewie-text-only",
    messages=[{"role": "user", "content": "Child has fever for 3 days"}],
    max_tokens=200,
    temperature=0.3
)
print(response.choices[0].message.content)

What Changed

Removed vision tower (~1GB saved)
Changed architecture to Gemma3ForCausalLM
Stripped language_model. prefix from weights
Reduced max_position_embeddings to 8192

Downloads last month: 47

Model tree for electricsheepafrica/medgemma-4b-it-text-only

Base model

google/gemma-3-4b-pt

Finetuned

google/medgemma-4b-pt

Finetuned

google/medgemma-4b-it

Finetuned

(543)

this model