# gprMax RAG Database System ## Overview This is a production-ready Retrieval-Augmented Generation (RAG) system for gprMax documentation. It provides efficient vector search capabilities for the gprMax documentation, enabling intelligent context retrieval for the chatbot. ## Architecture ### Components 1. **Document Processor**: Extracts and chunks documentation from gprMax GitHub repository 2. **Embedding Model**: Qwen2.5-0.5B (will upgrade to Qwen3-Embedding-0.6B when available) 3. **Vector Database**: ChromaDB with persistent storage 4. **Retriever**: Search and context retrieval utilities ### Key Features - Automatic documentation extraction from gprMax GitHub repository - Intelligent chunking with configurable size and overlap - Persistent vector database using ChromaDB - Efficient similarity search with score thresholding - Metadata tracking for reproducibility ## Installation The database is **automatically generated** on first startup of the application. No manual installation required! ## Automatic Generation When the app starts: 1. Checks if database exists at `rag-db/chroma_db/` 2. If not found, automatically runs `generate_db.py` 3. Clones gprMax repository and processes documentation 4. Creates ChromaDB with default embeddings (all-MiniLM-L6-v2) 5. Ready to use - this only happens once! ## Manual Generation (Optional) If you need to manually regenerate the database: ```bash cd rag-db python generate_db.py --recreate ``` Custom settings: ```bash python generate_db.py \ --db-path ./custom_db \ --temp-dir ./temp \ --device cuda \ --recreate ``` ### 2. Use Retriever in Application ```python from rag_db.retriever import create_retriever # Initialize retriever retriever = create_retriever(db_path="./rag-db/chroma_db") # Search for relevant documents results = retriever.search("How to create a source?", k=5) # Get formatted context for LLM context = retriever.get_context("antenna patterns", k=3) # Get relevant source files files = retriever.get_relevant_files("boundary conditions") # Get database statistics stats = retriever.get_stats() ``` ### 3. Test Retriever ```bash # Test with default query python retriever.py # Test with custom query python retriever.py "How to model soil layers?" ``` ## Database Schema ### Document Structure ```json { "id": "unique_hash", "text": "document_chunk_text", "metadata": { "source": "docs/relative/path.rst", "file_type": ".rst", "chunk_index": 0, "char_start": 0, "char_end": 1000 } } ``` ### Metadata File Generated `metadata.json` contains: ```json { "created_at": "2024-01-01T00:00:00", "embedding_model": "Qwen/Qwen2.5-0.5B", "collection_name": "gprmax_docs_v1", "chunk_size": 1000, "chunk_overlap": 200, "total_documents": 1234 } ``` ## Configuration ### Chunking Parameters - `CHUNK_SIZE`: 1000 characters (optimal for context windows) - `CHUNK_OVERLAP`: 200 characters (ensures continuity) ### Embedding Model - Current: `Qwen/Qwen2.5-0.5B` (512-dim embeddings) - Future: `Qwen/Qwen3-Embedding-0.6B` (when available) ### Database Settings - Storage: ChromaDB persistent client - Collection: `gprmax_docs_v1` (versioned for updates) - Distance Metric: Cosine similarity ## Maintenance ### Regular Updates Run monthly or when gprMax documentation updates: ```bash # This will pull latest docs and update database python generate_db.py ``` ### Database Backup ```bash # Backup database cp -r chroma_db chroma_db_backup_$(date +%Y%m%d) ``` ### Performance Tuning - Adjust `CHUNK_SIZE` and `CHUNK_OVERLAP` in `generate_db.py` - Modify batch sizes for large datasets - Use GPU acceleration with `--device cuda` ## Integration with Main App The RAG system integrates with the main Gradio app: 1. Import retriever in `app.py` 2. Use retriever to augment prompts with context 3. Display source references in UI Example integration: ```python # In app.py from rag_db.retriever import create_retriever retriever = create_retriever() def augment_with_context(user_query): context = retriever.get_context(user_query, k=3) augmented_prompt = f""" Context from documentation: {context} User question: {user_query} """ return augmented_prompt ``` ## Troubleshooting ### Common Issues 1. **Database not found** - Run `python generate_db.py` first - Check `--db-path` parameter 2. **Out of memory** - Use smaller batch sizes - Use CPU instead of GPU - Reduce chunk size 3. **Slow generation** - Use GPU with `--device cuda` - Reduce repository depth with shallow clone - Use pre-generated database ### Logs Check generation logs for detailed information: ```bash python generate_db.py 2>&1 | tee generation.log ``` ## Future Enhancements 1. **Model Upgrade**: Migrate to Qwen3-Embedding-0.6B when available 2. **Incremental Updates**: Add documents without full regeneration 3. **Multi-modal Support**: Include images and diagrams from docs 4. **Query Expansion**: Automatic query reformulation for better retrieval 5. **Caching Layer**: Redis cache for frequent queries 6. **Fine-tuned Embeddings**: Domain-specific embedding model for gprMax ## License Same as parent project