Data Requirements

Last Updated: November 7, 2025

Overview

The AI Workspace accepts single-cell RNA sequencing data in H5AD format (AnnData objects). This page outlines the specific requirements for preparing your data.

File Format Requirements

H5AD Format

Required: Files must be in H5AD format (.h5ad extension)
Size Limit: Maximum 10GB per file

Data Structure

Your H5AD file must contain:

Essential Components

X matrix: Gene expression counts (cells × genes)
- Can be raw counts or normalized values
- Sparse matrices supported
- Integer counts required for some workflows
obs (cell metadata): Cell-level annotations
- Index: Unique cell identifiers
- Optional: batch information, cell types, experimental conditions
var (gene metadata): Gene-level annotations
- Index: Gene identifiers (ENSEMBL IDs required for organism detection)
- ENSEMBL IDs can be in var_names or any var column
- System automatically detects organism from ENSEMBL gene ID prefixes

Example Structure

adata.X              # Expression matrix (n_cells × n_genes)
adata.obs            # Cell metadata DataFrame
adata.var            # Gene metadata DataFrame with ENSEMBL IDs
# Organism automatically detected from ENSEMBL gene ID prefixes

Organism Support

Standard Workflow

PCA: Supports any organism
scVI: Human and mouse only
TranscriptFormer: All supported organisms with ENSEMBL gene IDs:
- Human (homo_sapiens) - ENSG prefix
- Mouse (mus_musculus) - ENSMUSG prefix
- Frog (xenopus_tropicalis) - ENSXETG prefix
- African clawed frog (xenopus_laevis) - ENSXLAG prefix
- Zebrafish (danio_rerio) - ENSDARG prefix
- Mouse lemur (microcebus_murinus) - ENSMICG prefix
- Pig (sus_scrofa) - ENSSSCG prefix
- Cynomolgus monkey (macaca_fascicularis) - ENSMFAG prefix
- Rhesus monkey (macaca_mulatta) - ENSMMUG prefix
- Platypus (ornithorhynchus_anatinus) - ENSOANG prefix
- Opossum (monodelphis_domestica) - ENSMODG prefix
- Gorilla (gorilla_gorilla) - ENSGGOG prefix
- Chimpanzee (pan_troglodytes) - ENSPTRG prefix
- Marmoset (callithrix_jacchus) - ENSCJAG prefix
- Chicken (gallus_gallus) - ENSGALG prefix
- Rabbit (oryctolagus_cuniculus) - ENSOCUG prefix
- Rat (rattus_norvegicus) - ENSRNOG prefix
- Naked mole rat (heterocephalus_glaber) - ENSHGLG prefix
- Sea lamprey (petromyzon_marinus) - ENSPMAG prefix
- Freshwater sponge (spongilla_lacustris) - ENSLPGG prefix

Comparison Workflow

scVI: Human and mouse only (Census data limitation)
TranscriptFormer: Human and mouse only (Census data limitation)

Gene Identifier Requirements

ENSEMBL Gene IDs

Location: Can be in var_names (index) or any column in var DataFrame
Format: Standard ENSEMBL format (e.g., ENSG00000000003, ENSMUSG00000000001)
Detection: System automatically detects organism from ENSEMBL prefixes
Requirement: At least 75% of genes must have valid ENSEMBL IDs for organism detection

Gene Filtering

Minimum genes per cell: 200 (applied during preprocessing)
Minimum cells per gene: 3 (applied during preprocessing)
Mitochondrial gene filtering: Cells with greater than 20% mitochondrial expression removed

Data Quality Guidelines

Cell Count Limits

Maximum for all models: 2 million cells
TranscriptFormer: Maximum 100,000 cells
PCA and scVI: No additional limits beyond the 2M maximum

Data Type Requirements

PCA: Accepts both raw counts and processed data
scVI and TranscriptFormer: Require raw integer count data
Comparison workflow: Requires raw integer count data (any model)

Model-Specific Requirements

PCA Analysis

Data type: Raw counts or processed data accepted
Organisms: Any organism supported
Cell limit: Up to 2 million cells

scVI Embeddings

Data type: Raw integer counts required
Organisms: Human and mouse only
Cell limit: Up to 2 million cells

TranscriptFormer

Data type: Raw integer counts required
Organisms: All organisms with ENSEMBL gene IDs (Standard workflow only)
Cell limit: Maximum 100,000 cells
Model variants: Sapiens, Exemplar, or Metazoa (Learn more)

Workflow-Specific Requirements

Standard Workflow

Data type: Raw counts or processed data (depends on model)
Organisms: Any organism (depends on model)
Available models: PCA, scVI, TranscriptFormer

Comparison Workflow

Data type: Raw integer counts required
Organisms: Human and mouse only (Census data limitation)
Available models: scVI, TranscriptFormer only
Gene requirement: ENSEMBL IDs starting with ENSG or ENSMUSG

File Upload Process

Upload Methods

Drag and drop: Drag H5AD files to upload area
Multiple files: Upload multiple datasets simultaneously

Upload Validation

The system automatically validates:

File format and structure
Organism compatibility
Gene identifier format

Post-Upload Processing

After upload, files undergo registration, which includes data validation and model/workflow compatibility check

File Management

File Retention

User files: Automatically deleted after 6 months
Demo files: Permanently available for testing
Results: Workflows and outputs deleted after 6 months

Troubleshooting Common Issues

Validation Errors

No ENSEMBL IDs: ENSEMBL gene IDs must be in var_names or any var column
Organism detection failed: Ensure at least 75% of genes have valid ENSEMBL IDs
Incompatible organism: Check organism support for your chosen model/workflow

Model Compatibility

scVI unavailable: Check organism and Census model availability
TranscriptFormer variant restricted: Some variants only support specific organisms
Classification disabled: Requires scVI/TranscriptFormer with human/mouse data