Distributed EDA for Cell x Gene
This folder includes a metadata-aware EDA pipeline for large .h5ad files with YAML-based configuration.
All commands below assume your current directory is:
cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
Quick Start
0) Check system resources (optional)
Probe your system and get recommendations for config settings:
# Using your config file
uv run python scripts/resource_probe.py --config configs/eda_optimized.yaml
# Or specify workdir manually
uv run python scripts/resource_probe.py --workdir /project/GOV108018
This outputs recommendations for workers, memory, and sharding based on your system.
1) Configure pipeline
Use the optimized config (auto-generated for your system: 394 GB RAM, 56 cores):
cat configs/eda_optimized.yaml
Or create your own based on the template:
cp configs/eda_config_template.yaml configs/my_config.yaml
# Edit my_config.yaml with your paths and resource limits
2) Build metadata cache
Pre-scan all datasets to determine sizes and enable intelligent scheduling:
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
This creates output/cache/enhanced_metadata.parquet with:
- Dataset dimensions (n_obs × n_vars)
- File sizes
- Size categories (small/medium/large/xlarge)
- Estimated memory requirements
Cache is incremental - only new/changed files are rescanned. Use --force-rescan to rebuild.
Handling corrupted files: If some files fail during scanning (status='failed' or 'corrupted'), you can:
- Retry them with a more robust strategy:
uv run python scripts/retry_failed_cache.py --cache output/cache/enhanced_metadata.parquet - Re-download corrupted files from CELLxGENE:
uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml
See Troubleshooting for details.
3) Run EDA pipeline
If cache already built, run EDA directly:
# Run EDA only (skip metadata step)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step eda
Or run all steps (cache is incremental, so safe to re-run):
# Full pipeline: metadata + EDA + merge
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml
Individual steps:
# Step 1: Build metadata (incremental - only new/changed files)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step metadata
# Step 2: Run EDA only
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step eda
# Step 3: Merge shards (if using sharding)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step merge
4) Direct script usage
For more control:
# Build metadata cache
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
# Run EDA with all workers
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml
# Override worker count
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --force-workers 32
Distributed Processing (SLURM)
For multi-node HPC clusters, use array jobs:
# Submit 4 parallel jobs
sbatch --array=0-3 scripts/run_eda_slurm.sh configs/eda_optimized.yaml 4
# After all jobs complete, merge results
uv run python scripts/merge_eda_shards.py --output-dir output/eda
Or configure sharding in YAML:
sharding:
enabled: true
num_shards: 4
shard_index: 0 # Override with --shard-index on command line
strategy: "size_balanced" # Distribute by size for load balancing
Then run each shard:
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 0
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 1
# ... etc
Configuration Guide
Resource Management
The pipeline respects your resource limits and adapts processing strategy by dataset size:
resources:
max_memory_gib: 240 # Total memory available
max_workers: 42 # Maximum parallel workers
min_workers: 8 # Starting worker count
chunk_size: 12288 # Matrix chunk size
# Performance tuning
slowdown_threshold: 0.5 # Reduce workers if throughput drops by 50%
min_workers_ratio: 0.25 # Keep at least 25% of max workers
large_file_threshold_gib: 30.0 # Process files >30GB serially
dataset_thresholds:
small: 2_000_000_000 # < 2B entries
medium: 15_000_000_000 # < 15B entries
large: 40_000_000_000 # < 40B entries (Dask safe)
max_entries: 1_000_000_000_000 # Reject larger datasets
strategy:
small:
chunk_size_multiplier: 1.0 # Use full chunk size
medium:
chunk_size_multiplier: 1.0 # Full chunk size for efficiency
large:
chunk_size_multiplier: 0.9 # Slightly reduced for stability
xlarge:
chunk_size_multiplier: 0.2 # Very small chunks for sequential processing
Key parameters explained:
- slowdown_threshold: If processing throughput drops below this fraction of baseline (e.g., 0.5 = 50%), the pipeline automatically reduces worker count to prevent memory thrashing
- min_workers_ratio: Minimum workers to keep as a fraction of max_workers (e.g., 0.25 = always keep at least 25% of workers)
- large_file_threshold_gib: Files larger than this size (in GB) use a separate worker pool during metadata cache building
- large_file_workers: Number of parallel workers for large files (0 = serial processing, >0 = parallel). With high RAM systems, using 4-8 workers significantly speeds up cache building
- chunk_size_multiplier: Adjusts the chunk size based on dataset category (smaller = safer for large datasets)
Dataset Slicing
Large datasets are automatically sliced to respect memory limits:
slicing:
enabled: true
obs_slice_size: 300000 # Default for medium/large datasets
obs_slice_size_xlarge: 100000 # Xlarge datasets use smaller slices
overlap: 0
merge_strategy: "combine" # Combine slice statistics
Metadata Integration
Point to CELLxGENE metadata CSVs for enhanced context:
paths:
metadata_csvs:
- /project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv
- /project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv
enhanced_metadata_cache: output/cache/enhanced_metadata.parquet
The pipeline merges this with quick-scanned dimensions for intelligent scheduling.
Processing Strategy
The pipeline uses parallel processing with priority ordering:
- Pre-scan phase: Quick metadata scan (no matrix loading) categorizes datasets by size
- Parallel execution: All datasets process in parallel using full worker pool
- Smart ordering: Small datasets (priority 1) start first for quick wins
- Automatic slicing: Large datasets split into memory-safe chunks
- Resource-aware: Strategies adapt chunk sizes based on dataset category
This approach fully leverages all available cores throughout the entire pipeline.
Outputs
- Per shard summary CSV:
output/eda/eda_summary_shard_XXX_of_YYY.csv
- Per shard failure log:
output/eda/eda_failures_shard_XXX_of_YYY.json
- Per dataset JSON details:
output/eda/per_dataset/*.json
- Merged summary (after sharding):
output/eda/eda_summary_all_shards.csv
- Global max report:
output/eda/max_nonzero_gene_count_all_cells.csvoutput/eda/max_nonzero_gene_count_all_cells.json
- Metadata cache:
output/cache/enhanced_metadata.parquet
Output Schema
Each dataset result includes:
- Dimensions: n_obs, n_vars, total_entries
- Sparsity: nnz, sparsity
- Cell statistics: cell_sum_*, cell_nnz_* (mean/std/min/max/quantiles)
- Matrix statistics: x_mean, x_std
- Metadata summaries: obs/var column types and top values
- Schema: Complete column names and dtypes
- Processing info: size_category, processing_mode (full/sliced), elapsed_sec
Visualization
Open the notebook:
uv run jupyter lab notebooks/max_nonzero_gene_report.ipynb
The notebook provides:
- Global max non-zero gene count
- Distribution of cell-level statistics
- Dataset size analysis
- Processing time comparisons
Troubleshooting
Corrupted or failed datasets
If the metadata cache builder reports failed or corrupted files:
Step 1: Retry with robust strategy
Some files may fail due to transient issues or need special handling:
# Using config file (recommended - uses paths and thresholds from config)
uv run python scripts/retry_failed_cache.py --config configs/eda_optimized.yaml
# Or specify paths manually
uv run python scripts/retry_failed_cache.py \
--cache output/cache/enhanced_metadata.parquet \
--h5ad-dirs /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
/project/GOV108018/cell_x_gene/mus_musculus/h5ad
This script:
- Retries failed datasets with progressively safer strategies (anndata backed mode → h5py direct)
- Categorizes truly corrupted files (truncated/damaged HDF5 structure)
- Merges retry results back into the cache
- Reports final statistics (successful recoveries vs truly corrupted)
Step 2: Re-download corrupted files
For files that are truly corrupted (status='corrupted'), re-download fresh copies from CELLxGENE:
uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml
This script:
- Identifies corrupted files from the metadata cache
- Looks up dataset IDs and download URLs from CELLxGENE metadata CSVs
- Downloads files to
output/temp/for safety - Verifies each downloaded file is valid HDF5
- Moves verified files to replace corrupted originals
- Keeps failed downloads in temp for inspection
After re-downloading, rebuild the metadata cache to update the status:
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml --force-rescan
Typical corruption causes:
- Interrupted downloads during dataset collection
- HDF5 file not properly closed/finalized during creation
- Storage/filesystem errors
- Network transfer errors from original source
Note: Files marked as 'corrupted' have HDF5 structural issues (truncated superblock, missing data blocks) and cannot be repaired - they must be re-downloaded from the source.
Metadata cache not found
# Build it first
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
Memory errors
Reduce workers and chunk size in config:
resources:
max_workers: 24
chunk_size: 4096
large_file_threshold_gib: 20.0 # Process larger files separately
large_file_workers: 2 # Fewer workers for large files (or 0 for serial)
slicing:
obs_slice_size: 50000
Performance degradation during processing
The pipeline automatically reduces workers if throughput drops significantly. Adjust sensitivity:
resources:
slowdown_threshold: 0.3 # More sensitive (30% slowdown triggers reduction)
min_workers_ratio: 0.5 # Keep more workers (50% minimum)
Dataset too large
Adjust thresholds or enable more aggressive slicing:
dataset_thresholds:
max_entries: 50_000_000_000 # Lower limit
slicing:
obs_slice_size: 30000 # Smaller slices