Spaces:
Running
A newer version of the Gradio SDK is available:
6.0.2
Multimodal Settings & File Rendering - Implementation Summary
β Completed Implementation
1. Configuration Updates (src/utils/config.py)
Added Settings:
- β
enable_image_input: bool = Field(default=True, ...)- Enable/disable image OCR processing - β
ocr_api_url: str | None = Field(default="https://prithivmlmods-multimodal-ocr3.hf.space", ...)- OCR service URL
Location: Lines 148-156 (after enable_audio_output)
2. Multimodal Service Updates (src/services/multimodal_processing.py)
Changes:
- β
Added check for
settings.enable_image_inputbefore processing image files - β Image processing now respects the enable/disable setting (similar to audio input)
Location: Line 66 - Added condition: if files and settings.enable_image_input:
3. Sidebar Reorganization (src/app.py)
New Accordion: "π· Multimodal Input"
- β
Added
enable_image_input_checkbox- Control image OCR processing - β
Added
enable_audio_input_checkbox- Control audio STT processing - β Located after "Research Configuration" accordion
Updated Accordion: "π Audio Output"
- β
Moved
audio_outputcomponent into this accordion (was in main area) - β Component now appears in sidebar with other audio settings
- β
Visibility controlled by
enable_audio_output_checkbox
Settings Organization:
- π¬ Research Configuration (existing)
- π· Multimodal Input (NEW)
- π Audio Output (updated - now includes audio_output component)
Location: Lines 770-850
4. Function Signature Updates (src/app.py)
Updated research_agent() function:
- β
Added
enable_image_input: bool = Trueparameter - β
Added
enable_audio_input: bool = Trueparameter - β Function now accepts UI settings directly from sidebar checkboxes
Location: Lines 535-547
5. Multimodal Input Processing (src/app.py)
Updates:
- β
Uses function parameters (
enable_image_input,enable_audio_input) instead of only config settings - β Filters files and audio based on UI settings before processing
- β More responsive to user changes (no need to restart app)
Location: Lines 624-636
6. File Rendering Improvements (src/app.py)
Enhancements:
- β Added file size display in download links
- β Better error handling for file size retrieval
- β Improved formatting with file size information (B, KB, MB)
Location: Lines 286-300
7. UI Description Updates (src/app.py)
Enhanced Description:
- β Better explanation of multimodal capabilities
- β Clear list of supported input types (Images, Audio, Text)
- β Reference to sidebar settings for configuration
Location: Lines 907-912
π Current Settings Structure
Sidebar Layout:
π Authentication
- Login button
- About section
βοΈ Settings
ββ π¬ Research Configuration
β ββ Orchestrator Mode
β ββ Graph Research Mode
β ββ Use Graph Execution
β
ββ π· Multimodal Input (NEW)
β ββ Enable Image Input (OCR)
β ββ Enable Audio Input (STT)
β
ββ π Audio Output
ββ Enable Audio Output
ββ TTS Voice
ββ TTS Speech Speed
ββ TTS GPU Type (if Modal available)
ββ π Audio Response (moved from main area)
π Key Features
Multimodal Inputs (Always Visible)
- Image Upload: Available in ChatInterface textbox (multimodal=True)
- Audio Recording: Available in ChatInterface textbox (multimodal=True)
- File Upload: Supported via MultimodalTextbox
- Visibility: Always visible - part of ChatInterface component
- Control: Can be enabled/disabled via sidebar settings
File Rendering
- Method: Markdown download links in chat content
- Format:
π [Download: filename (size)](filepath) - Validation: Checks file existence before rendering
- Metadata: Files stored in message metadata for future use
Settings Flow
- User changes settings in sidebar checkboxes
- Settings passed to
research_agent()viaadditional_inputs - Function uses UI settings (with config defaults as fallback)
- Multimodal processing respects enable/disable flags
- Settings persist during chat session
π§ͺ Testing Checklist
- Verify all settings are in sidebar
- Test image upload with OCR enabled/disabled
- Test audio recording with STT enabled/disabled
- Test file rendering (markdown, PDF, images)
- Test audio output generation and display in sidebar
- Test file download links
- Verify settings work without requiring app restart
- Test on different screen sizes (responsive design)
π Notes
Multimodal Inputs Visibility: The inputs are always visible because they're part of the
MultimodalTextboxcomponent whenmultimodal=Trueis set in ChatInterface. No additional visibility control is needed.Settings Persistence: Settings are passed via function parameters, so they persist during the chat session but reset when the app restarts. For persistent settings across sessions, consider using Gradio's state management or session storage.
File Rendering: Gradio ChatInterface automatically handles markdown file links. The current implementation with file size information should work well. For more advanced file previews, consider using Gradio's File component in a custom Blocks layout.
Hidden Components: The
hf_model_dropdownandhf_provider_dropdownare still hidden. Consider making them visible in a "Model Configuration" accordion if needed, or remove them if not used.
π Next Steps (Optional Enhancements)
- Model Configuration Accordion: Make hf_model and hf_provider visible in sidebar
- File Previews: Add image previews for uploaded images in chat
- Settings Persistence: Implement session-based settings storage
- Advanced File Rendering: Use Gradio File component for better file handling
- Error Handling: Add better error messages for failed file operations