Data Requirements
Last Updated: November 7, 2025
Overview
The AI Workspace accepts single-cell RNA sequencing data in H5AD format (AnnData objects). This page outlines the specific requirements for preparing your data.
File Format Requirements
H5AD Format
- Required: Files must be in H5AD format (
.h5adextension) - Size Limit: Maximum 10GB per file
Data Structure
Your H5AD file must contain:
Essential Components
-
Xmatrix: Gene expression counts (cells × genes)- Can be raw counts or normalized values
- Sparse matrices supported
- Integer counts required for some workflows
-
obs(cell metadata): Cell-level annotations- Index: Unique cell identifiers
- Optional: batch information, cell types, experimental conditions
-
var(gene metadata): Gene-level annotations- Index: Gene identifiers (ENSEMBL IDs required for organism detection)
- ENSEMBL IDs can be in
var_namesor anyvarcolumn - System automatically detects organism from ENSEMBL gene ID prefixes
Example Structure
adata.X # Expression matrix (n_cells × n_genes)
adata.obs # Cell metadata DataFrame
adata.var # Gene metadata DataFrame with ENSEMBL IDs
# Organism automatically detected from ENSEMBL gene ID prefixesOrganism Support
Standard Workflow
- PCA: Supports any organism
- scVI: Human and mouse only
- TranscriptFormer: All supported organisms with ENSEMBL gene IDs:
- Human (
homo_sapiens) - ENSG prefix - Mouse (
mus_musculus) - ENSMUSG prefix - Frog (
xenopus_tropicalis) - ENSXETG prefix - African clawed frog (
xenopus_laevis) - ENSXLAG prefix - Zebrafish (
danio_rerio) - ENSDARG prefix - Mouse lemur (
microcebus_murinus) - ENSMICG prefix - Pig (
sus_scrofa) - ENSSSCG prefix - Cynomolgus monkey (
macaca_fascicularis) - ENSMFAG prefix - Rhesus monkey (
macaca_mulatta) - ENSMMUG prefix - Platypus (
ornithorhynchus_anatinus) - ENSOANG prefix - Opossum (
monodelphis_domestica) - ENSMODG prefix - Gorilla (
gorilla_gorilla) - ENSGGOG prefix - Chimpanzee (
pan_troglodytes) - ENSPTRG prefix - Marmoset (
callithrix_jacchus) - ENSCJAG prefix - Chicken (
gallus_gallus) - ENSGALG prefix - Rabbit (
oryctolagus_cuniculus) - ENSOCUG prefix - Rat (
rattus_norvegicus) - ENSRNOG prefix - Naked mole rat (
heterocephalus_glaber) - ENSHGLG prefix - Sea lamprey (
petromyzon_marinus) - ENSPMAG prefix - Freshwater sponge (
spongilla_lacustris) - ENSLPGG prefix
- Human (
Comparison Workflow
- scVI: Human and mouse only (Census data limitation)
- TranscriptFormer: Human and mouse only (Census data limitation)
Gene Identifier Requirements
ENSEMBL Gene IDs
- Location: Can be in
var_names(index) or any column invarDataFrame - Format: Standard ENSEMBL format (e.g.,
ENSG00000000003,ENSMUSG00000000001) - Detection: System automatically detects organism from ENSEMBL prefixes
- Requirement: At least 75% of genes must have valid ENSEMBL IDs for organism detection
Gene Filtering
- Minimum genes per cell: 200 (applied during preprocessing)
- Minimum cells per gene: 3 (applied during preprocessing)
- Mitochondrial gene filtering: Cells with greater than 20% mitochondrial expression removed
Data Quality Guidelines
Cell Count Limits
- Maximum for all models: 2 million cells
- TranscriptFormer: Maximum 100,000 cells
- PCA and scVI: No additional limits beyond the 2M maximum
Data Type Requirements
- PCA: Accepts both raw counts and processed data
- scVI and TranscriptFormer: Require raw integer count data
- Comparison workflow: Requires raw integer count data (any model)
Model-Specific Requirements
PCA Analysis
- Data type: Raw counts or processed data accepted
- Organisms: Any organism supported
- Cell limit: Up to 2 million cells
scVI Embeddings
- Data type: Raw integer counts required
- Organisms: Human and mouse only
- Cell limit: Up to 2 million cells
TranscriptFormer
- Data type: Raw integer counts required
- Organisms: All organisms with ENSEMBL gene IDs (Standard workflow only)
- Cell limit: Maximum 100,000 cells
- Model variants: Sapiens, Exemplar, or Metazoa (Learn more)
Workflow-Specific Requirements
Standard Workflow
- Data type: Raw counts or processed data (depends on model)
- Organisms: Any organism (depends on model)
- Available models: PCA, scVI, TranscriptFormer
Comparison Workflow
- Data type: Raw integer counts required
- Organisms: Human and mouse only (Census data limitation)
- Available models: scVI, TranscriptFormer only
- Gene requirement: ENSEMBL IDs starting with ENSG or ENSMUSG
File Upload Process
Upload Methods
- Drag and drop: Drag H5AD files to upload area
- Multiple files: Upload multiple datasets simultaneously
Upload Validation
The system automatically validates:
- File format and structure
- Organism compatibility
- Gene identifier format
Post-Upload Processing
After upload, files undergo registration, which includes data validation and model/workflow compatibility check
File Management
File Retention
- User files: Automatically deleted after 6 months
- Demo files: Permanently available for testing
- Results: Workflows and outputs deleted after 6 months
Troubleshooting Common Issues
Validation Errors
- No ENSEMBL IDs: ENSEMBL gene IDs must be in
var_namesor anyvarcolumn - Organism detection failed: Ensure at least 75% of genes have valid ENSEMBL IDs
- Incompatible organism: Check organism support for your chosen model/workflow
Model Compatibility
- scVI unavailable: Check organism and Census model availability
- TranscriptFormer variant restricted: Some variants only support specific organisms
- Classification disabled: Requires scVI/TranscriptFormer with human/mouse data