Try Models

Co-Embed Workflow

Last Updated: January 15, 2026

The Co-Embed Workflow integrates multiple datasets into a unified embedding space, enabling comparative analysis across different data sources.

Overview

The Co-Embed Workflow combines two datasets through a complete integration pipeline:

  1. Data Combination - Outer-join datasets with automatic source tracking
  2. Quality Control - Apply filtering to combined dataset
  3. Normalization - Standardize gene expression across datasets
  4. Feature Selection - Identify highly variable genes in combined data
  5. Dimensionality Reduction - Generate integrated embeddings (PCA/scVI/TranscriptFormer)
  6. Batch Correction - Apply integration methods (Harmony for PCA, de novo training for scVI)
  7. Label Transfer - Transfer cell type labels from reference to target (optional)
  8. Visualization - Create UMAP coordinates for 2D plotting
  9. Export - Convert results to CellxGene Explorer format

Requirements

  • Two H5AD files - Exactly two datasets required for co-embedding
  • Compatible gene IDs - Datasets must share compatible gene identifier format
  • Reference dataset - For classification, reference must contain cell_type column
  • Model compatibility - All models (PCA, scVI, TranscriptFormer) support co-embed

Computational Steps

1. Data Loading and Combination

Dataset 1 (Target) + Dataset 2 (Reference)
- Load and validate both H5AD files
- Perform outer-join on gene space
- Create dataset_source column to track cell origin
- Label datasets as "dataset_1" and "dataset_2"
- Align metadata structures

Note: The second file uploaded becomes the reference dataset (dataset_2) for label transfer operations.

2. Quality Control (Automatic)

Combined Raw Data
- Filter cells: minimum 200 genes per cell
- Filter genes: minimum 3 cells per gene
- Remove high-mitochondrial cells (greater than 20% MT genes)
- Calculate quality metrics per dataset source

Quality control is applied to the combined dataset, maintaining cell origin tracking throughout.

3. Normalization and Preprocessing

Combined Filtered Data
- Detect data type (raw counts vs normalized)
- Normalize to 10,000 counts per cell (if raw)
- Log-transform (log1p)
- Store raw counts in layers['counts']
- Preserve dataset_source tracking

4. Feature Selection

Combined Normalized Data
- Identify highly variable genes across both datasets
- Apply batch-aware selection using dataset_source
- Use Seurat v3 method for feature selection
- Subset data to highly variable genes (top 3,000)

Model Options

PCA (Principal Component Analysis)

Features

  • Linear method: Captures major sources of variation across datasets
  • Harmony integration: Automatic batch correction for dataset integration
  • Batch column: Uses dataset_source automatically created from outer-join
  • Speed: Fastest processing time for multi-dataset analysis
  • Output: Integrated embeddings in X_pca after Harmony correction

Processing Steps

  1. PCA Step: Performs principal component analysis on combined dataset
    • Uses dataset_source as batch column
    • Generates initial PCA embeddings
  2. Harmony Integration: Applies Harmony batch correction
    • Corrects for batch effects between datasets
    • Outputs integrated embeddings in X_pca

scVI (Single-cell Variational Inference)

Features

  • De novo training: When classification enabled, trains new model on combined data
  • Pre-trained inference: Standard inference when classification disabled
  • Batch correction: Built-in handling of batch effects via dataset_source
  • Training epochs: 400 epochs for de novo training to ensure proper integration
  • Classification compatible: Supports label transfer from reference dataset

Processing Steps

With Classification Enabled:

  1. De Novo Training: Trains scVI model on combined datasets
    • Uses dataset_source as batch column
    • Runs for 400 epochs
    • Produces better integration quality for label transfer

Without Classification:

  1. Pre-trained Inference: Uses standard scVI inference
    • Leverages Census pre-trained models
    • Faster processing time

TranscriptFormer

An LLM adapted for cross-species single-cell transcriptomics. Learn more about TranscriptFormer

  • Multi-dataset support: Handles combined datasets in unified embedding space
  • Batch awareness: Integrates datasets through model architecture

Label Transfer (Classification)

When classification is enabled, the workflow transfers cell type labels from the reference dataset to the target dataset.

Requirements

  • Reference dataset: Must contain cell_type column in metadata
  • Two datasets: Exactly two files required
  • Model compatibility: Works with all models (PCA, scVI, TranscriptFormer)

Process

Combined Dataset with Integrated Embeddings
1. Identify reference dataset as dataset_2 (second file uploaded)
2. Extract cell_type labels from reference metadata
3. Generate integrated embeddings using selected model
4. Apply KNN classifier in embedding space
5. Transfer labels from reference to target cells
6. Calculate confidence scores for predictions

Embedding Selection

The workflow automatically selects the appropriate embedding based on your model:

  • PCA: Uses X_pca (Harmony-corrected embeddings)
  • scVI: Uses scvi embeddings
  • TranscriptFormer: Uses transcriptformer embeddings

Classification Output

  • Predicted cell type labels for each cell in target dataset
  • Confidence scores indicating prediction reliability
  • Reference labels preserved for reference dataset cells
  • Integration quality metrics showing how well datasets integrated

Note: Currently requires cell_type column in reference data. Future versions will support dynamic column selection.

Use Cases

Experimental Condition Comparison

Compare cells from different experimental conditions (e.g., treated vs control) in a unified embedding space.

Time Series Integration

Integrate data from multiple time points to track cellular changes over time.

Cross-Study Integration

Combine datasets from different studies or labs while maintaining batch correction.

Reference-Guided Annotation

Transfer well-annotated cell type labels from a reference dataset to an unannotated query dataset.

Expected Results

Processing Time

  • PCA with Harmony: 5-10 minutes for typical datasets
  • scVI (pre-trained): 5-10 minutes depending on dataset size
  • scVI (de novo training): 15-30 minutes for combined datasets, uses GPU acceleration
  • TranscriptFormer: ~1-2 hours, requires GPU

Output Files

Your completed co-embed workflow provides:

Processed H5AD File

  • Combined data: Both datasets integrated in single file
  • Dataset tracking: dataset_source column identifies cell origin
  • Quality metrics: Cell and gene filtering statistics per dataset
  • Integrated embeddings: Model-specific embeddings in .obsm
  • UMAP coordinates: 2D visualization coordinates for combined data
  • Transferred labels: Cell type predictions for target dataset (if enabled)
  • Reference labels: Original cell type labels from reference dataset
  • Interactive visualization: Explore combined datasets in web browser
  • Dataset origin: Color by dataset_source to distinguish datasets
  • Gene expression: View expression patterns across both datasets
  • Embeddings: Navigate integrated PCA/scVI/TranscriptFormer-derived UMAPs
  • Cell metadata: Color by dataset source, transferred cell types, quality metrics, etc.
  • Gene search: Find and highlight specific genes across datasets
  • Comparative analysis: Compare cell type distributions between datasets