YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Distributed EDA for Cell x Gene

This folder includes a metadata-aware EDA pipeline for large .h5ad files with YAML-based configuration.

All commands below assume your current directory is:

cd /project/GOV108018/whats2000_work/cell_x_gene_visualization

Quick Start

0) Check system resources (optional)

Probe your system and get recommendations for config settings:

# Using your config file
uv run python scripts/resource_probe.py --config configs/eda_optimized.yaml

# Or specify workdir manually
uv run python scripts/resource_probe.py --workdir /project/GOV108018

This outputs recommendations for workers, memory, and sharding based on your system.

1) Configure pipeline

Use the optimized config (auto-generated for your system: 394 GB RAM, 56 cores):

cat configs/eda_optimized.yaml

Or create your own based on the template:

cp configs/eda_config_template.yaml configs/my_config.yaml
# Edit my_config.yaml with your paths and resource limits

2) Build metadata cache

Pre-scan all datasets to determine sizes and enable intelligent scheduling:

uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml

This creates output/cache/enhanced_metadata.parquet with:

Dataset dimensions (n_obs × n_vars)
File sizes
Size categories (small/medium/large/xlarge)
Estimated memory requirements

Cache is incremental - only new/changed files are rescanned. Use --force-rescan to rebuild.

Handling corrupted files: If some files fail during scanning (status='failed' or 'corrupted'), you can:

Retry them with a more robust strategy: uv run python scripts/retry_failed_cache.py --cache output/cache/enhanced_metadata.parquet
Re-download corrupted files from CELLxGENE: uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml

See Troubleshooting for details.

3) Run EDA pipeline

If cache already built, run EDA directly:

# Run EDA only (skip metadata step)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step eda

Or run all steps (cache is incremental, so safe to re-run):

# Full pipeline: metadata + EDA + merge
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml

Individual steps:

# Step 1: Build metadata (incremental - only new/changed files)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step metadata

# Step 2: Run EDA only
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step eda

# Step 3: Merge shards (if using sharding)
uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step merge

4) Direct script usage

For more control:

# Build metadata cache
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml

# Run EDA with all workers
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml

# Override worker count
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --force-workers 32

Distributed Processing (SLURM)

For multi-node HPC clusters, use array jobs:

# Submit 4 parallel jobs
sbatch --array=0-3 scripts/run_eda_slurm.sh configs/eda_optimized.yaml 4

# After all jobs complete, merge results
uv run python scripts/merge_eda_shards.py --output-dir output/eda

Or configure sharding in YAML:

sharding:
  enabled: true
  num_shards: 4
  shard_index: 0  # Override with --shard-index on command line
  strategy: "size_balanced"  # Distribute by size for load balancing

Then run each shard:

uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 0
uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 1
# ... etc

Configuration Guide

Resource Management

The pipeline respects your resource limits and adapts processing strategy by dataset size:

resources:
  max_memory_gib: 240      # Total memory available
  max_workers: 42          # Maximum parallel workers
  min_workers: 8           # Starting worker count
  chunk_size: 12288        # Matrix chunk size
  
  # Performance tuning
  slowdown_threshold: 0.5  # Reduce workers if throughput drops by 50%
  min_workers_ratio: 0.25  # Keep at least 25% of max workers
  large_file_threshold_gib: 30.0  # Process files >30GB serially

dataset_thresholds:
  small: 2_000_000_000      # < 2B entries
  medium: 15_000_000_000    # < 15B entries  
  large: 40_000_000_000     # < 40B entries (Dask safe)
  max_entries: 1_000_000_000_000  # Reject larger datasets

strategy:
  small:
    chunk_size_multiplier: 1.0     # Use full chunk size
  
  medium:
    chunk_size_multiplier: 1.0     # Full chunk size for efficiency
  
  large:
    chunk_size_multiplier: 0.9     # Slightly reduced for stability
  
  xlarge:
    chunk_size_multiplier: 0.2     # Very small chunks for sequential processing

Key parameters explained:

slowdown_threshold: If processing throughput drops below this fraction of baseline (e.g., 0.5 = 50%), the pipeline automatically reduces worker count to prevent memory thrashing
min_workers_ratio: Minimum workers to keep as a fraction of max_workers (e.g., 0.25 = always keep at least 25% of workers)
large_file_threshold_gib: Files larger than this size (in GB) use a separate worker pool during metadata cache building
large_file_workers: Number of parallel workers for large files (0 = serial processing, >0 = parallel). With high RAM systems, using 4-8 workers significantly speeds up cache building
chunk_size_multiplier: Adjusts the chunk size based on dataset category (smaller = safer for large datasets)

Dataset Slicing

Large datasets are automatically sliced to respect memory limits:

slicing:
  enabled: true
  obs_slice_size: 300000  # Default for medium/large datasets
  obs_slice_size_xlarge: 100000  # Xlarge datasets use smaller slices
  overlap: 0
  merge_strategy: "combine"  # Combine slice statistics

Metadata Integration

Point to CELLxGENE metadata CSVs for enhanced context:

paths:
  metadata_csvs:
    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv
    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv
  enhanced_metadata_cache: output/cache/enhanced_metadata.parquet

The pipeline merges this with quick-scanned dimensions for intelligent scheduling.

Processing Strategy

The pipeline uses parallel processing with priority ordering:

Pre-scan phase: Quick metadata scan (no matrix loading) categorizes datasets by size
Parallel execution: All datasets process in parallel using full worker pool
Smart ordering: Small datasets (priority 1) start first for quick wins
Automatic slicing: Large datasets split into memory-safe chunks
Resource-aware: Strategies adapt chunk sizes based on dataset category

This approach fully leverages all available cores throughout the entire pipeline.

Outputs

Per shard summary CSV:
- output/eda/eda_summary_shard_XXX_of_YYY.csv
Per shard failure log:
- output/eda/eda_failures_shard_XXX_of_YYY.json
Per dataset JSON details:
- output/eda/per_dataset/*.json
Merged summary (after sharding):
- output/eda/eda_summary_all_shards.csv
Global max report:
- output/eda/max_nonzero_gene_count_all_cells.csv
- output/eda/max_nonzero_gene_count_all_cells.json
Metadata cache:
- output/cache/enhanced_metadata.parquet

Output Schema

Each dataset result includes:

Dimensions: n_obs, n_vars, total_entries
Sparsity: nnz, sparsity
Cell statistics: cell_sum_*, cell_nnz_* (mean/std/min/max/quantiles)
Matrix statistics: x_mean, x_std
Metadata summaries: obs/var column types and top values
Schema: Complete column names and dtypes
Processing info: size_category, processing_mode (full/sliced), elapsed_sec

Visualization

Open the notebook:

uv run jupyter lab notebooks/max_nonzero_gene_report.ipynb

The notebook provides:

Global max non-zero gene count
Distribution of cell-level statistics
Dataset size analysis
Processing time comparisons

Troubleshooting

Corrupted or failed datasets

If the metadata cache builder reports failed or corrupted files:

Step 1: Retry with robust strategy

Some files may fail due to transient issues or need special handling:

# Using config file (recommended - uses paths and thresholds from config)
uv run python scripts/retry_failed_cache.py --config configs/eda_optimized.yaml

# Or specify paths manually
uv run python scripts/retry_failed_cache.py \
  --cache output/cache/enhanced_metadata.parquet \
  --h5ad-dirs /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
              /project/GOV108018/cell_x_gene/mus_musculus/h5ad

This script:

Retries failed datasets with progressively safer strategies (anndata backed mode → h5py direct)
Categorizes truly corrupted files (truncated/damaged HDF5 structure)
Merges retry results back into the cache
Reports final statistics (successful recoveries vs truly corrupted)

Step 2: Re-download corrupted files

For files that are truly corrupted (status='corrupted'), re-download fresh copies from CELLxGENE:

uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml

This script:

Identifies corrupted files from the metadata cache
Looks up dataset IDs and download URLs from CELLxGENE metadata CSVs
Downloads files to output/temp/ for safety
Verifies each downloaded file is valid HDF5
Moves verified files to replace corrupted originals
Keeps failed downloads in temp for inspection

After re-downloading, rebuild the metadata cache to update the status:

uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml --force-rescan

Typical corruption causes:

Interrupted downloads during dataset collection
HDF5 file not properly closed/finalized during creation
Storage/filesystem errors
Network transfer errors from original source

Note: Files marked as 'corrupted' have HDF5 structural issues (truncated superblock, missing data blocks) and cannot be repaired - they must be re-downloaded from the source.

Metadata cache not found

# Build it first
uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml

Memory errors

Reduce workers and chunk size in config:

resources:
  max_workers: 24
  chunk_size: 4096
  large_file_threshold_gib: 20.0  # Process larger files separately
  large_file_workers: 2  # Fewer workers for large files (or 0 for serial)
slicing:
  obs_slice_size: 50000

Performance degradation during processing

The pipeline automatically reduces workers if throughput drops significantly. Adjust sensitivity:

resources:
  slowdown_threshold: 0.3  # More sensitive (30% slowdown triggers reduction)
  min_workers_ratio: 0.5   # Keep more workers (50% minimum)

Dataset too large

Adjust thresholds or enable more aggressive slicing:

dataset_thresholds:
  max_entries: 50_000_000_000  # Lower limit
slicing:
  obs_slice_size: 30000  # Smaller slices

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support