Comparison Workflow
Last Updated: January 27, 2025
The Comparison Workflow integrates your data with the CELLxGENE Census reference atlas.
Overview
The Comparison Workflow enhances your dataset by:
- Reference Selection - Filter CELLxGENE Census by tissue, disease, etc.
- Similarity Matching - Find reference cells similar to your data
- Data Integration - Combine your cells with selected reference cells
- Joint Analysis - Perform dimensionality reduction on combined dataset
- Visualization - Generate UMAP with query and reference cells labeled
- Classification - Predict cell types using reference annotations
Requirements
- Human or mouse data only - Other organisms not supported
- scVI or TranscriptFormer model - PCA not compatible with comparison
- Gene overlap - Sufficient genes must match CELLxGENE Census
- Compatible format - Standard H5AD requirements apply
CELLxGENE Census Integration
What is CELLxGENE Census?
The CELLxGENE Census is a comprehensive reference atlas containing:
- Millions of cells from healthy and diseased tissues
- Standardized annotations with consistent cell type labels
- Quality controlled data with uniform processing
- Pre-trained models for embedding generation
- Diverse tissues across human and mouse organisms
Reference Data Selection
The workflow automatically:
- Filters by organism (human/mouse) based on your data
- Applies tissue filters if specified (brain, blood, lung, etc.)
- Samples proportionally to maintain cell type diversity
- Limits total cells to computational constraints (250k-1M cells)
Computational Steps
1. Data Preparation
Your H5AD File
↓
- Load and validate dataset structure
- Extract organism from metadata
- Generate model embeddings (scVI/TranscriptFormer)
- Prepare for Census integration2. Reference Data Query
CELLxGENE Census
↓
- Query based on organism (human/mouse)
- Apply tissue/condition filters if specified
- Sample cells proportionally by cell type
- Download reference metadata and embeddings3. Similarity Search (Optional)
Query Embeddings + Reference Embeddings
↓
- Build nearest neighbor index (HNSW algorithm)
- Find k=30 nearest neighbors for each query cell
- Select unique reference cells (up to 1M total)
- Filter reference data to similar cells onlySimilarity Search Options:
- Enabled: Uses nearest neighbor search to find most similar reference cells
- Disabled: Random sampling of all matching reference cells
4. Data Integration
Query Data + Reference Data
↓
- Align gene spaces between datasets
- Handle missing genes with outer join
- Combine observation metadata
- Label cells as "My data" vs "Reference"Similarity Search Options
Enabled
- Uses nearest neighbor search to find most similar reference cells
- Includes up to 1M reference cells most similar to your data
- More computationally intensive processing
Disabled
- Random sampling of all matching reference cells
- Includes up to 250k reference cells from filtered Census data
- Faster processing
Configuration Options
Reference Filters
Available tissue filters include:
- General tissue: Brain, blood, lung, heart, liver, kidney, etc.
- All tissues: No tissue filtering (full organism reference)
- Multiple selection: Choose multiple tissues simultaneously
Workflow Parameters
- Model selection: scVI or TranscriptFormer required
- Similarity search: Enable/disable nearest neighbor matching
- Tissue filters: Select relevant tissues from dropdown
- TranscriptFormer variant: Choose appropriate model variant
Expected Results
Processing Time
- Data preparation: 5-10 minutes
- Reference download: 10-20 minutes
- Integration: 15-30 minutes
- Total time: 30-60 minutes depending on dataset size
Visualization Output
Your completed comparison provides:
Interactive CellxGene Explorer
- Combined dataset: Your cells + reference cells in same embedding space
- Cell origin labels: Distinguish "My data" from "Reference" cells
- Cell type annotations: Reference cell type labels for context
- Gene expression: Compare expression patterns between datasets
- Neighborhood analysis: See which reference cells are most similar