PDF-Parser

Sleeping

App Files Files Community

PDF-Parser / README.md

saifisvibinn

Fix API docs: ensure HTTPS, add /api/predict endpoint to README

2226eb2 4 months ago

preview code

raw

history blame contribute delete

3.13 kB

metadata

title: PDF Layout Extractor
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false

PDF Layout Extraction Tool

A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO.

Features

REST API endpoints for programmatic access
Real-time progress tracking with visual progress bar
Multiple processing modes: Images only, Markdown only, or Both
Background processing - upload files and track progress via API
Modern web UI with dark/light theme
GPU/CPU support - automatically detects available hardware

API Endpoints

Base URL: https://saifisvibin-volaris-pdf-tool.hf.space

POST /api/predict - Recommended: Synchronous PDF extraction (returns complete results immediately)
POST /api/upload - Upload PDFs for async processing (returns task_id)
GET /api/progress/<task_id> - Get processing progress (0-100%)
GET /api/pdf-list - List all processed PDFs
GET /api/pdf-details/<pdf_stem> - Get details for a processed PDF
GET /api/device-info - Get GPU/CPU device information
GET /api/docs - Interactive API documentation
GET /output/<path> - Download processed files (PDFs, images, markdown)

Example API Usage

Simple Synchronous Extraction (Recommended)

import requests

# Upload and get results immediately
files = {'file': open('document.pdf', 'rb')}
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files)
result = response.json()

print(f"Text: {result['text']}")
print(f"Figures: {len(result['figures'])}")
print(f"Tables: {len(result['tables'])}")

Async Processing with Progress

import requests
import time

# Upload a PDF (async)
files = {'files[]': open('document.pdf', 'rb')}
data = {'extraction_mode': 'both'}  # or 'images' or 'markdown'
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data)
task_id = response.json()['task_id']

# Poll for progress
while True:
    progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json()
    print(f"Progress: {progress['progress']}% - {progress['message']}")
    if progress['status'] == 'completed':
        break
    time.sleep(0.5)

# Get results
results = progress['results']

Processing Modes

Images Only - Extracts figures and tables with layout detection
Markdown Only - Extracts text content as markdown
Both - Extracts both images and markdown content

Output Structure

Each processed PDF creates a directory with:

*_content_list.json - Metadata for extracted figures/tables
*_layout.pdf - Annotated PDF with layout boxes
*.md - Markdown export (if enabled)
figures/ - Extracted figure images
tables/ - Extracted table images

Built With

DocLayout-YOLO - Layout detection
PyMuPDF - PDF processing
Flask - Web framework