Why Normalization Matters
Raw RNA-seq counts are affected by technical factors: sequencing depth, gene length, and library composition. Proper normalization is essential for meaningful comparisons.
Common Normalization Methods
RPKM/FPKM (Reads/Fragments Per Kilobase Million)
Corrects for gene length and sequencing depthProblem: The sum differs across samples, making between-sample comparisons problematicTPM (Transcripts Per Million)
Gene length normalized first, then scaled to 1 millionAdvantage: Consistent sum across samplesUse case: Comparing expression across samplesTMM (Trimmed Mean of M-values)
Used by edgeR for between-sample normalizationAssumes most genes are NOT differentially expressedCalculates scaling factors to account for composition biasDESeq2 Size Factors
Similar philosophy to TMMUses median of ratios methodRobust to outliers and lowly expressed genesWhen to Use What
| Method | Within-sample | Between-sample | For DE |
|--------|---------------|----------------|--------|
| RPKM/FPKM | ✓ | ✗ | ✗ |
| TPM | ✓ | ✓ | ✗ |
| TMM | ✗ | ✓ | ✓ |
| DESeq2 | ✗ | ✓ | ✓ |
Key Takeaways
Never use RPKM/FPKM for DE analysisUse TPM for visualization and cross-sample comparisonUse raw counts + TMM/DESeq2 for differential expressionAlways document your normalization choice