Co-Embed Workflow
Last Updated: January 15, 2026
The Co-Embed Workflow integrates multiple datasets into a unified embedding space, enabling comparative analysis across different data sources.
Overview
The Co-Embed Workflow combines two datasets through a complete integration pipeline:
- Data Combination - Outer-join datasets with automatic source tracking
- Quality Control - Apply filtering to combined dataset
- Normalization - Standardize gene expression across datasets
- Feature Selection - Identify highly variable genes in combined data
- Dimensionality Reduction - Generate integrated embeddings (PCA/scVI/TranscriptFormer)
- Batch Correction - Apply integration methods (Harmony for PCA, de novo training for scVI)
- Label Transfer - Transfer cell type labels from reference to target (optional)
- Visualization - Create UMAP coordinates for 2D plotting
- Export - Convert results to CellxGene Explorer format
Requirements
- Two H5AD files - Exactly two datasets required for co-embedding
- Compatible gene IDs - Datasets must share compatible gene identifier format
- Reference dataset - For classification, reference must contain
cell_typecolumn - Model compatibility - All models (PCA, scVI, TranscriptFormer) support co-embed
Computational Steps
1. Data Loading and Combination
Dataset 1 (Target) + Dataset 2 (Reference)
↓
- Load and validate both H5AD files
- Perform outer-join on gene space
- Create dataset_source column to track cell origin
- Label datasets as "dataset_1" and "dataset_2"
- Align metadata structuresNote: The second file uploaded becomes the reference dataset (dataset_2) for label transfer operations.
2. Quality Control (Automatic)
Combined Raw Data
↓
- Filter cells: minimum 200 genes per cell
- Filter genes: minimum 3 cells per gene
- Remove high-mitochondrial cells (greater than 20% MT genes)
- Calculate quality metrics per dataset sourceQuality control is applied to the combined dataset, maintaining cell origin tracking throughout.
3. Normalization and Preprocessing
Combined Filtered Data
↓
- Detect data type (raw counts vs normalized)
- Normalize to 10,000 counts per cell (if raw)
- Log-transform (log1p)
- Store raw counts in layers['counts']
- Preserve dataset_source tracking4. Feature Selection
Combined Normalized Data
↓
- Identify highly variable genes across both datasets
- Apply batch-aware selection using dataset_source
- Use Seurat v3 method for feature selection
- Subset data to highly variable genes (top 3,000)Model Options
PCA (Principal Component Analysis)
Features
- Linear method: Captures major sources of variation across datasets
- Harmony integration: Automatic batch correction for dataset integration
- Batch column: Uses
dataset_sourceautomatically created from outer-join - Speed: Fastest processing time for multi-dataset analysis
- Output: Integrated embeddings in
X_pcaafter Harmony correction
Processing Steps
- PCA Step: Performs principal component analysis on combined dataset
- Uses
dataset_sourceas batch column - Generates initial PCA embeddings
- Uses
- Harmony Integration: Applies Harmony batch correction
- Corrects for batch effects between datasets
- Outputs integrated embeddings in
X_pca
scVI (Single-cell Variational Inference)
Features
- De novo training: When classification enabled, trains new model on combined data
- Pre-trained inference: Standard inference when classification disabled
- Batch correction: Built-in handling of batch effects via
dataset_source - Training epochs: 400 epochs for de novo training to ensure proper integration
- Classification compatible: Supports label transfer from reference dataset
Processing Steps
With Classification Enabled:
- De Novo Training: Trains scVI model on combined datasets
- Uses
dataset_sourceas batch column - Runs for 400 epochs
- Produces better integration quality for label transfer
- Uses
Without Classification:
- Pre-trained Inference: Uses standard scVI inference
- Leverages Census pre-trained models
- Faster processing time
TranscriptFormer
An LLM adapted for cross-species single-cell transcriptomics. Learn more about TranscriptFormer
- Multi-dataset support: Handles combined datasets in unified embedding space
- Batch awareness: Integrates datasets through model architecture
Label Transfer (Classification)
When classification is enabled, the workflow transfers cell type labels from the reference dataset to the target dataset.
Requirements
- Reference dataset: Must contain
cell_typecolumn in metadata - Two datasets: Exactly two files required
- Model compatibility: Works with all models (PCA, scVI, TranscriptFormer)
Process
Combined Dataset with Integrated Embeddings
↓
1. Identify reference dataset as dataset_2 (second file uploaded)
2. Extract cell_type labels from reference metadata
3. Generate integrated embeddings using selected model
4. Apply KNN classifier in embedding space
5. Transfer labels from reference to target cells
6. Calculate confidence scores for predictionsEmbedding Selection
The workflow automatically selects the appropriate embedding based on your model:
- PCA: Uses
X_pca(Harmony-corrected embeddings) - scVI: Uses
scviembeddings - TranscriptFormer: Uses
transcriptformerembeddings
Classification Output
- Predicted cell type labels for each cell in target dataset
- Confidence scores indicating prediction reliability
- Reference labels preserved for reference dataset cells
- Integration quality metrics showing how well datasets integrated
Note: Currently requires cell_type column in reference data. Future versions will support dynamic column selection.
Use Cases
Experimental Condition Comparison
Compare cells from different experimental conditions (e.g., treated vs control) in a unified embedding space.
Time Series Integration
Integrate data from multiple time points to track cellular changes over time.
Cross-Study Integration
Combine datasets from different studies or labs while maintaining batch correction.
Reference-Guided Annotation
Transfer well-annotated cell type labels from a reference dataset to an unannotated query dataset.
Expected Results
Processing Time
- PCA with Harmony: 5-10 minutes for typical datasets
- scVI (pre-trained): 5-10 minutes depending on dataset size
- scVI (de novo training): 15-30 minutes for combined datasets, uses GPU acceleration
- TranscriptFormer: ~1-2 hours, requires GPU
Output Files
Your completed co-embed workflow provides:
Processed H5AD File
- Combined data: Both datasets integrated in single file
- Dataset tracking:
dataset_sourcecolumn identifies cell origin - Quality metrics: Cell and gene filtering statistics per dataset
- Integrated embeddings: Model-specific embeddings in
.obsm - UMAP coordinates: 2D visualization coordinates for combined data
- Transferred labels: Cell type predictions for target dataset (if enabled)
- Reference labels: Original cell type labels from reference dataset
CellxGene Explorer Link
- Interactive visualization: Explore combined datasets in web browser
- Dataset origin: Color by
dataset_sourceto distinguish datasets - Gene expression: View expression patterns across both datasets
- Embeddings: Navigate integrated PCA/scVI/TranscriptFormer-derived UMAPs
- Cell metadata: Color by dataset source, transferred cell types, quality metrics, etc.
- Gene search: Find and highlight specific genes across datasets
- Comparative analysis: Compare cell type distributions between datasets