Workflows

Last Updated: January 15, 2026

Overview

The AI Workspace offers multiple workflow types for analyzing single-cell RNA sequencing data. Each workflow serves different analytical needs and has specific requirements:

Standard Workflow: Analyze a single dataset independently
Co-Embed Workflow: Integrate multiple datasets into a unified embedding space
Comparison Workflow: Compare your data against the CELLxGENE Census reference atlas

Standard Workflow

Analyze your dataset independently without reference to external data.

Purpose: Single-dataset analysis and exploration
Data Requirements: H5AD files with ENSEMBL gene IDs
Available Models: PCA, scVI, TranscriptFormer
Organism Support: Varies by model (see Data Requirements)

Learn more about Standard Workflow →

Co-Embed Workflow

Integrate multiple datasets into a unified embedding space for comparative analysis.

Purpose: Combine and analyze multiple datasets together
Data Requirements: Two H5AD files with compatible gene IDs
Available Models: PCA, scVI, TranscriptFormer
Integration: Automatic batch correction via Harmony (PCA) or de novo training (scVI)
Classification: De novo label transfer from reference dataset to target dataset

Learn more about Co-Embed Workflow →

Comparison Workflow

Compare your data against the CELLxGENE Census reference atlas.

Purpose: Contextualize cells against comprehensive reference data
Data Requirements: H5AD files with human or mouse data only
Available Models: scVI, TranscriptFormer only
Gene Requirements: ENSEMBL IDs (ENSG or ENSMUSG prefixes)

Learn more about Comparison Workflow →

Co-Embed Workflow

The Co-Embed workflow integrates multiple datasets into a unified embedding space, enabling comparative analysis across different data sources.

Co-Embed is automatically triggered when you explicitly select the "co-embed" workflow type.

Processing Steps

The workflow combines datasets through an outer-join operation that creates a dataset_source column to track the origin of each cell. The integration approach varies by model:

PCA with Co-Embed

PCA Step: Performs principal component analysis on the combined dataset
- Uses dataset_source as the batch column (automatically created from outer-join)
Harmony Integration: Applies Harmony batch correction to integrate the datasets
- Corrects for batch effects between different data sources
- Uses dataset_source as the batch column
- Outputs integrated embeddings in X_pca

scVI with Co-Embed

De Novo Training: When classification is also enabled, the workflow uses de novo scVI training instead of pre-trained inference
- Trains a new model on the combined datasets
- Uses dataset_source as the batch column
- Runs for 400 epochs to ensure proper integration
- Produces better integration quality for label transfer

Use Cases

Compare cells from different experimental conditions
Integrate data from multiple time points
Combine datasets from different studies or labs
Transfer labels from a reference dataset to a query dataset

Cell Type Classification

Automated cell type annotation transfers cell type labels to your data using either pre-trained Census models or de novo label transfer from a reference dataset.

Classification Methods

The system uses two different approaches depending on your workflow configuration:

Standard Classification (Pre-trained Census Model)

Available for Standard and Comparison workflows when using scVI or TranscriptFormer models.

Method: Uses pre-trained models from the CELLxGENE Census
Requirements:
- scVI or TranscriptFormer models only (PCA not supported)
- Human or mouse data
Process:
1. Generates embeddings using your selected model
2. Applies pre-trained classifier trained on Census reference data
3. Transfers cell type labels with confidence scores
Embedding Keys:
- scVI: Uses scvi embeddings
- TranscriptFormer: Uses transcriptformer embeddings

Co-Embed Classification (De Novo Label Transfer)

Available when combining multiple datasets in a Co-Embed workflow.

Method: Transfers labels from a reference dataset to a target dataset
Requirements:
- Two uploaded files
- Reference dataset must contain a cell_type column in its metadata
- Works with PCA, scVI, and TranscriptFormer models
Process:
1. Combines datasets via outer-join (creates dataset_source column)
2. Generates integrated embeddings using your selected model
3. Identifies reference dataset as dataset_2 (second file uploaded)
4. Uses KNN classifier to transfer labels from reference to target cells
5. Applies labels based on nearest neighbors in embedding space
Embedding Keys (automatically selected based on model):
- PCA: Uses X_pca (Harmony-corrected embeddings)
- scVI: Uses scvi embeddings
- TranscriptFormer: Uses transcriptformer embeddings
Note: Currently requires cell_type column in reference data (future versions will support dynamic column selection)

Classification Output

Both methods produce:

Predicted cell type labels for each cell in your dataset
Confidence scores indicating prediction reliability
Integration quality metrics (for co-embed workflows)

When to Use Each Method

Pre-trained Census: Best for single-dataset analysis when you want to leverage comprehensive reference annotations
De Novo Transfer: Best when you have a specific reference dataset with known cell types that you want to transfer to your query data

Workflow Selection

Feature	Standard	Co-Embed	Comparison
Number of Datasets	1	2	1
Organism Support	Model-dependent	Model-dependent	Human/mouse only
Available Models	PCA, scVI, TranscriptFormer	PCA, scVI, TranscriptFormer	scVI, TranscriptFormer
Data Requirements	Raw or processed (model-dependent)	Raw or processed (model-dependent)	Raw integer counts only
Cell Type Classification	Pre-trained Census (scVI/TranscriptFormer, human/mouse)	De Novo Transfer (all models, requires reference with cell_type column)	Pre-trained Census (scVI/TranscriptFormer, human/mouse)