Spaces:
Sleeping
Sleeping
metadata
title: PDF Layout Extractor
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
PDF Layout Extraction Tool
A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO.
Features
- REST API endpoints for programmatic access
- Real-time progress tracking with visual progress bar
- Multiple processing modes: Images only, Markdown only, or Both
- Background processing - upload files and track progress via API
- Modern web UI with dark/light theme
- GPU/CPU support - automatically detects available hardware
API Endpoints
Base URL: https://saifisvibin-volaris-pdf-tool.hf.space
POST /api/predict- Recommended: Synchronous PDF extraction (returns complete results immediately)POST /api/upload- Upload PDFs for async processing (returnstask_id)GET /api/progress/<task_id>- Get processing progress (0-100%)GET /api/pdf-list- List all processed PDFsGET /api/pdf-details/<pdf_stem>- Get details for a processed PDFGET /api/device-info- Get GPU/CPU device informationGET /api/docs- Interactive API documentationGET /output/<path>- Download processed files (PDFs, images, markdown)
Example API Usage
Simple Synchronous Extraction (Recommended)
import requests
# Upload and get results immediately
files = {'file': open('document.pdf', 'rb')}
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files)
result = response.json()
print(f"Text: {result['text']}")
print(f"Figures: {len(result['figures'])}")
print(f"Tables: {len(result['tables'])}")
Async Processing with Progress
import requests
import time
# Upload a PDF (async)
files = {'files[]': open('document.pdf', 'rb')}
data = {'extraction_mode': 'both'} # or 'images' or 'markdown'
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data)
task_id = response.json()['task_id']
# Poll for progress
while True:
progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json()
print(f"Progress: {progress['progress']}% - {progress['message']}")
if progress['status'] == 'completed':
break
time.sleep(0.5)
# Get results
results = progress['results']
Processing Modes
- Images Only - Extracts figures and tables with layout detection
- Markdown Only - Extracts text content as markdown
- Both - Extracts both images and markdown content
Output Structure
Each processed PDF creates a directory with:
*_content_list.json- Metadata for extracted figures/tables*_layout.pdf- Annotated PDF with layout boxes*.md- Markdown export (if enabled)figures/- Extracted figure imagestables/- Extracted table images
Built With
- DocLayout-YOLO - Layout detection
- PyMuPDF - PDF processing
- Flask - Web framework