Try Models

Data Requirements

Last Updated: November 7, 2025

Overview

The AI Workspace accepts single-cell RNA sequencing data in H5AD format (AnnData objects). This page outlines the specific requirements for preparing your data.

File Format Requirements

H5AD Format

  • Required: Files must be in H5AD format (.h5ad extension)
  • Size Limit: Maximum 10GB per file

Data Structure

Your H5AD file must contain:

Essential Components

  • X matrix: Gene expression counts (cells × genes)

    • Can be raw counts or normalized values
    • Sparse matrices supported
    • Integer counts required for some workflows
  • obs (cell metadata): Cell-level annotations

    • Index: Unique cell identifiers
    • Optional: batch information, cell types, experimental conditions
  • var (gene metadata): Gene-level annotations

    • Index: Gene identifiers (ENSEMBL IDs required for organism detection)
    • ENSEMBL IDs can be in var_names or any var column
    • System automatically detects organism from ENSEMBL gene ID prefixes

Example Structure

adata.X              # Expression matrix (n_cells × n_genes)
adata.obs            # Cell metadata DataFrame
adata.var            # Gene metadata DataFrame with ENSEMBL IDs
# Organism automatically detected from ENSEMBL gene ID prefixes

Organism Support

Standard Workflow

  • PCA: Supports any organism
  • scVI: Human and mouse only
  • TranscriptFormer: All supported organisms with ENSEMBL gene IDs:
    • Human (homo_sapiens) - ENSG prefix
    • Mouse (mus_musculus) - ENSMUSG prefix
    • Frog (xenopus_tropicalis) - ENSXETG prefix
    • African clawed frog (xenopus_laevis) - ENSXLAG prefix
    • Zebrafish (danio_rerio) - ENSDARG prefix
    • Mouse lemur (microcebus_murinus) - ENSMICG prefix
    • Pig (sus_scrofa) - ENSSSCG prefix
    • Cynomolgus monkey (macaca_fascicularis) - ENSMFAG prefix
    • Rhesus monkey (macaca_mulatta) - ENSMMUG prefix
    • Platypus (ornithorhynchus_anatinus) - ENSOANG prefix
    • Opossum (monodelphis_domestica) - ENSMODG prefix
    • Gorilla (gorilla_gorilla) - ENSGGOG prefix
    • Chimpanzee (pan_troglodytes) - ENSPTRG prefix
    • Marmoset (callithrix_jacchus) - ENSCJAG prefix
    • Chicken (gallus_gallus) - ENSGALG prefix
    • Rabbit (oryctolagus_cuniculus) - ENSOCUG prefix
    • Rat (rattus_norvegicus) - ENSRNOG prefix
    • Naked mole rat (heterocephalus_glaber) - ENSHGLG prefix
    • Sea lamprey (petromyzon_marinus) - ENSPMAG prefix
    • Freshwater sponge (spongilla_lacustris) - ENSLPGG prefix

Comparison Workflow

  • scVI: Human and mouse only (Census data limitation)
  • TranscriptFormer: Human and mouse only (Census data limitation)

Gene Identifier Requirements

ENSEMBL Gene IDs

  • Location: Can be in var_names (index) or any column in var DataFrame
  • Format: Standard ENSEMBL format (e.g., ENSG00000000003, ENSMUSG00000000001)
  • Detection: System automatically detects organism from ENSEMBL prefixes
  • Requirement: At least 75% of genes must have valid ENSEMBL IDs for organism detection

Gene Filtering

  • Minimum genes per cell: 200 (applied during preprocessing)
  • Minimum cells per gene: 3 (applied during preprocessing)
  • Mitochondrial gene filtering: Cells with greater than 20% mitochondrial expression removed

Data Quality Guidelines

Cell Count Limits

  • Maximum for all models: 2 million cells
  • TranscriptFormer: Maximum 100,000 cells
  • PCA and scVI: No additional limits beyond the 2M maximum

Data Type Requirements

  • PCA: Accepts both raw counts and processed data
  • scVI and TranscriptFormer: Require raw integer count data
  • Comparison workflow: Requires raw integer count data (any model)

Model-Specific Requirements

PCA Analysis

  • Data type: Raw counts or processed data accepted
  • Organisms: Any organism supported
  • Cell limit: Up to 2 million cells

scVI Embeddings

  • Data type: Raw integer counts required
  • Organisms: Human and mouse only
  • Cell limit: Up to 2 million cells

TranscriptFormer

  • Data type: Raw integer counts required
  • Organisms: All organisms with ENSEMBL gene IDs (Standard workflow only)
  • Cell limit: Maximum 100,000 cells
  • Model variants: Sapiens, Exemplar, or Metazoa (Learn more)

Workflow-Specific Requirements

Standard Workflow

  • Data type: Raw counts or processed data (depends on model)
  • Organisms: Any organism (depends on model)
  • Available models: PCA, scVI, TranscriptFormer

Comparison Workflow

  • Data type: Raw integer counts required
  • Organisms: Human and mouse only (Census data limitation)
  • Available models: scVI, TranscriptFormer only
  • Gene requirement: ENSEMBL IDs starting with ENSG or ENSMUSG

File Upload Process

Upload Methods

  1. Drag and drop: Drag H5AD files to upload area
  2. Multiple files: Upload multiple datasets simultaneously

Upload Validation

The system automatically validates:

  • File format and structure
  • Organism compatibility
  • Gene identifier format

Post-Upload Processing

After upload, files undergo registration, which includes data validation and model/workflow compatibility check

File Management

File Retention

  • User files: Automatically deleted after 6 months
  • Demo files: Permanently available for testing
  • Results: Workflows and outputs deleted after 6 months

Troubleshooting Common Issues

Validation Errors

  • No ENSEMBL IDs: ENSEMBL gene IDs must be in var_names or any var column
  • Organism detection failed: Ensure at least 75% of genes have valid ENSEMBL IDs
  • Incompatible organism: Check organism support for your chosen model/workflow

Model Compatibility

  • scVI unavailable: Check organism and Census model availability
  • TranscriptFormer variant restricted: Some variants only support specific organisms
  • Classification disabled: Requires scVI/TranscriptFormer with human/mouse data