Co-Embed Workflow

Last Updated: January 15, 2026

The Co-Embed Workflow integrates multiple datasets into a unified embedding space, enabling comparative analysis across different data sources.

Overview

The Co-Embed Workflow combines two datasets through a complete integration pipeline:

Data Combination - Outer-join datasets with automatic source tracking
Quality Control - Apply filtering to combined dataset
Normalization - Standardize gene expression across datasets
Feature Selection - Identify highly variable genes in combined data
Dimensionality Reduction - Generate integrated embeddings (PCA/scVI/TranscriptFormer)
Batch Correction - Apply integration methods (Harmony for PCA, de novo training for scVI)
Label Transfer - Transfer cell type labels from reference to target (optional)
Visualization - Create UMAP coordinates for 2D plotting
Export - Convert results to CellxGene Explorer format

Requirements

Two H5AD files - Exactly two datasets required for co-embedding
Compatible gene IDs - Datasets must share compatible gene identifier format
Reference dataset - For classification, reference must contain cell_type column
Model compatibility - All models (PCA, scVI, TranscriptFormer) support co-embed

Computational Steps

1. Data Loading and Combination

Dataset 1 (Target) + Dataset 2 (Reference)
↓
- Load and validate both H5AD files
- Perform outer-join on gene space
- Create dataset_source column to track cell origin
- Label datasets as "dataset_1" and "dataset_2"
- Align metadata structures

Note: The second file uploaded becomes the reference dataset (dataset_2) for label transfer operations.

2. Quality Control (Automatic)

Combined Raw Data
↓
- Filter cells: minimum 200 genes per cell
- Filter genes: minimum 3 cells per gene
- Remove high-mitochondrial cells (greater than 20% MT genes)
- Calculate quality metrics per dataset source

Quality control is applied to the combined dataset, maintaining cell origin tracking throughout.

3. Normalization and Preprocessing

Combined Filtered Data
↓
- Detect data type (raw counts vs normalized)
- Normalize to 10,000 counts per cell (if raw)
- Log-transform (log1p)
- Store raw counts in layers['counts']
- Preserve dataset_source tracking

4. Feature Selection

Combined Normalized Data
↓
- Identify highly variable genes across both datasets
- Apply batch-aware selection using dataset_source
- Use Seurat v3 method for feature selection
- Subset data to highly variable genes (top 3,000)

Model Options

PCA (Principal Component Analysis)

Features

Linear method: Captures major sources of variation across datasets
Harmony integration: Automatic batch correction for dataset integration
Batch column: Uses dataset_source automatically created from outer-join
Speed: Fastest processing time for multi-dataset analysis
Output: Integrated embeddings in X_pca after Harmony correction

Processing Steps

PCA Step: Performs principal component analysis on combined dataset
- Uses dataset_source as batch column
- Generates initial PCA embeddings
Harmony Integration: Applies Harmony batch correction
- Corrects for batch effects between datasets
- Outputs integrated embeddings in X_pca

scVI (Single-cell Variational Inference)

Features

De novo training: When classification enabled, trains new model on combined data
Pre-trained inference: Standard inference when classification disabled
Batch correction: Built-in handling of batch effects via dataset_source
Training epochs: 400 epochs for de novo training to ensure proper integration
Classification compatible: Supports label transfer from reference dataset

Processing Steps

With Classification Enabled:

De Novo Training: Trains scVI model on combined datasets
- Uses dataset_source as batch column
- Runs for 400 epochs
- Produces better integration quality for label transfer

Without Classification:

Pre-trained Inference: Uses standard scVI inference
- Leverages Census pre-trained models
- Faster processing time

TranscriptFormer

An LLM adapted for cross-species single-cell transcriptomics. Learn more about TranscriptFormer

Multi-dataset support: Handles combined datasets in unified embedding space
Batch awareness: Integrates datasets through model architecture

Label Transfer (Classification)

When classification is enabled, the workflow transfers cell type labels from the reference dataset to the target dataset.

Requirements

Reference dataset: Must contain cell_type column in metadata
Two datasets: Exactly two files required
Model compatibility: Works with all models (PCA, scVI, TranscriptFormer)

Process

Combined Dataset with Integrated Embeddings
↓
1. Identify reference dataset as dataset_2 (second file uploaded)
2. Extract cell_type labels from reference metadata
3. Generate integrated embeddings using selected model
4. Apply KNN classifier in embedding space
5. Transfer labels from reference to target cells
6. Calculate confidence scores for predictions

Embedding Selection

The workflow automatically selects the appropriate embedding based on your model:

PCA: Uses X_pca (Harmony-corrected embeddings)
scVI: Uses scvi embeddings
TranscriptFormer: Uses transcriptformer embeddings

Classification Output

Predicted cell type labels for each cell in target dataset
Confidence scores indicating prediction reliability
Reference labels preserved for reference dataset cells
Integration quality metrics showing how well datasets integrated

Note: Currently requires cell_type column in reference data. Future versions will support dynamic column selection.

Use Cases

Experimental Condition Comparison

Compare cells from different experimental conditions (e.g., treated vs control) in a unified embedding space.

Time Series Integration

Integrate data from multiple time points to track cellular changes over time.

Cross-Study Integration

Combine datasets from different studies or labs while maintaining batch correction.

Reference-Guided Annotation

Transfer well-annotated cell type labels from a reference dataset to an unannotated query dataset.

Expected Results

Processing Time

PCA with Harmony: 5-10 minutes for typical datasets
scVI (pre-trained): 5-10 minutes depending on dataset size
scVI (de novo training): 15-30 minutes for combined datasets, uses GPU acceleration
TranscriptFormer: ~1-2 hours, requires GPU

Output Files

Your completed co-embed workflow provides:

Processed H5AD File

Combined data: Both datasets integrated in single file
Dataset tracking: dataset_source column identifies cell origin
Quality metrics: Cell and gene filtering statistics per dataset
Integrated embeddings: Model-specific embeddings in .obsm
UMAP coordinates: 2D visualization coordinates for combined data
Transferred labels: Cell type predictions for target dataset (if enabled)
Reference labels: Original cell type labels from reference dataset

CellxGene Explorer Link

Interactive visualization: Explore combined datasets in web browser
Dataset origin: Color by dataset_source to distinguish datasets
Gene expression: View expression patterns across both datasets
Embeddings: Navigate integrated PCA/scVI/TranscriptFormer-derived UMAPs
Cell metadata: Color by dataset source, transferred cell types, quality metrics, etc.
Gene search: Find and highlight specific genes across datasets
Comparative analysis: Compare cell type distributions between datasets