A Guide to RNA-Seq Normalization Methods

Why Normalization Matters

Raw RNA-seq counts are affected by technical factors: sequencing depth, gene length, and library composition. Proper normalization is essential for meaningful comparisons.

Common Normalization Methods

RPKM/FPKM (Reads/Fragments Per Kilobase Million)

Corrects for gene length and sequencing depth

Problem: The sum differs across samples, making between-sample comparisons problematic

TPM (Transcripts Per Million)

Gene length normalized first, then scaled to 1 million

Advantage: Consistent sum across samples

Use case: Comparing expression across samples

TMM (Trimmed Mean of M-values)

Used by edgeR for between-sample normalization

Assumes most genes are NOT differentially expressed

Calculates scaling factors to account for composition bias

DESeq2 Size Factors

Similar philosophy to TMM

Uses median of ratios method

Robust to outliers and lowly expressed genes

When to Use What

|--------|---------------|----------------|--------|

| RPKM/FPKM | ✓ | ✗ | ✗ |

| TPM | ✓ | ✓ | ✗ |

| TMM | ✗ | ✓ | ✓ |

| DESeq2 | ✗ | ✓ | ✓ |

Key Takeaways

Never use RPKM/FPKM for DE analysis

Use TPM for visualization and cross-sample comparison

Use raw counts + TMM/DESeq2 for differential expression

Always document your normalization choice