RNA sequencing (RNA-Seq) is a powerful technique to assess an organism’s whole transcriptome at unprecedented levels of sensitivity and generate a snapshot of the presence and quantity of RNAs under specific conditions at a given point in time. As such, RNA-Seq provides global insights into the expression of genes within a biological sample across various orders of magnitude and allows researchers to determine relative and absolute expression changes in response to stimuli, during development, or across disease states.
Artificial spike-in controls are indispensable tools for RNA-Seq experiments (Chen et al., 2015; Hardwick et al., 2017). Spike-ins are typically synthetic nucleic acid sequences without complementarity to the organism of interest. They are added to biological samples at known concentrations and with known composition prior to sample processing for the sequencing experiment. Spike-ins serve as internal quantitative and qualitative standards, enabling researchers to evaluate the performance of the entire Next-Generation Sequencing (NGS) workflow. In addition, spike-in controls provide a ground truth independent of the biological sample, allow the assessment of technical biases, and enable normalization and absolute quantification.
Which spike-in controls to chose for your RNA-Seq project?
Spike-ins enable comprehensive assessment of the complete experimental performance, specifically evaluating dynamic range, sensitivity, reproducibility, isoform detection, and quantification accuracy. Further, these controls serve as internal standards, facilitating RNA quantification, data normalization, and technical variability assessment between samples. This robust internal control capability enables quality assurance in large-scale experiments, ensuring overall data consistency.
Despite their crucial role to assess the reliability of large datasets, spike-in controls are underused in NGS experiments. Only a small number of datasets published contain exogenous controls that allow to assess data quality on a subset of well-defined transcripts, whose composition, concentration and sequence is known and assess the effect of differing data analysis tools on the final results.
Spike-in RNA controls exist in various flavors, the two most used spike-in controls for bulk RNA-Seq are ERCCs and SIRV spike-ins.
1. ERCC (External RNA Controls Consortium) Spike-ins:
A set of 92 synthetic, polyadenylated RNA transcripts of varying lengths (250–2,000 nt), GC content, and concentration without sequence-similarity to any known natural transcripts. ERCCs are linear controls with one transcript only (Jiang et al., 2011).
2. SIRVs (Spike-In RNA Variant Controls):
Various sets of artificial polyadenylated spike-in control transcripts designed to mimic complex eukaryotic transcript structures. SIRVs comprise artificial gene loci with multiple transcript isoforms (alternative splicing, start/end sites, overlapping genes) and antisense transcripts. SIRVs enable control of isoform detection. Several SIRV sets additionally contain ERCCs for concentration and dynamic range assessments (Paul et al., 2016).
3. Small RNA Spike-ins:
Sets of small oligos in the size range of microRNAs (typically ~21 nt in length) with varying concentration and randomized ends used as controls for small RNA sequencing experiments (Lutzmayer et al., 2017). Small RNA spike-in controls require sRNA library preps and are not compatible with standard mRNA-Seq or total RNA-Seq protocols.
Table 1 | Spike-in RNA Controls for Gene Expression, Quantification, and Isoform Detection
| Spike-in Type | Designed For | Isoform Complexity | Quantification | Normalization | Structure | Usage |
| ERCC | Absolute quantification, QC | ✗ No | ✔ Yes | ✔ Yes | Linear | Widely used, but limited to concentration |
| SIRVs – Different Sets available | Isoform or absolute quantification (depending on set), assembly, QC | ✔ Yes | ✔ Yes | ✔ Yes | Isoforms | Best for splicing studies and quantification |
Applications of spike-in controls
Spike-ins are protocol- and platform-agnostic, which enables their use with all suitable library preparation techniques independent of the method or vendor. They can be used across organisms and sequencing platforms delivering valuable comparisons and quality metrics for short- and long-read sequencing and serve as reference point for data analysis and for evaluation of algorithms and tools.
1. Absolute quantification
The most common use of artificial spike-in controls is absolute quantification. RNA-Seq typically provides relative expression levels through normalization to Transcripts Per Million (TPM) or Fragments Per Kilobase of transcript per Million mapped reads (FPKM). However, each normalization shows constraints and is dependent on the composition of the sample itself. For example, a higher percentage of ribosomal RNA reads in one sample may lead to a skewed distribution of TPM-normalized gene or transcript expression (Zhao et al., 2021). In addition, accurate cross-sample or cross-study comparisons of expression data are impossible without a calibration method. Spike-ins can resolve this issue as they are added at known molar concentrations and thus allow for the generation of standard calibration curves relating read counts directly to the number of RNA molecules in the sample for absolute transcript abundance measurement (Jiang et al., 2011).
2. Normalization across samples and across studies
Conventional normalization methods are dependent on the sample itself and are based on the assumption that the majority of genes is unchanged between conditions. This method however has severe disadvantages upon global transcriptional shifts, for example upon treatments promoting transcriptional arrests, circadian rhythm-dependent transcriptional programs, or cell-size changes in single-cell approaches (Risso et al., 2014). Spike-ins provide an external reference that does not change with the biology of the sample, allowing for accurate scaling even when total RNA content varies drastically. The percentage of reads observed for the artificial controls at a constant spike-in amount thereby serves as an indicator for changes in the general RNA composition between samples or treatments.
3. Performance evaluation and quality control
Spike-ins allow for an empirical evaluation of various technical metrics and complete RNA-Seq workflows within and across experiments and at different sites. Measuring the dynamic range verifies transcript detection across several orders of magnitude. The lower limit of spike-in detection allows measuring the assay sensitivity. Error sources and bias identification across the complete wet lab and data analysis pipelines (for more information visit SIRVs (Spike-in RNA Variant Control Mixes).
4. Transcript assembly calibration
Spike-in controls mimicking isoform complexity can be used to assess transcript assemblies or isoform capture and detection. The known isoform composition allows to evaluate the capability of a workflow to correctly detect and identify isoforms within a sample by providing a ground truth that researchers can use to calibrate their protocols, sequencing procedures, and algorithms (Schon et al., 2022).
Practical examples for the use of spike-in controls in scientific studies
Spike-in normalization is particularly useful when assessing conditions that may globally alter transcription levels, e.g., when assessing compounds acting on polymerases or when transcription levels follow a circadian rhythm (Jaeger et al., 2020; Laosuntisuk et al., 2024). Addition of artificial spike-in RNA provides a reference to normalize and assess gene expression independent of total RNA abundance, identifying significant differential expression even at overall reduced or elevated RNA levels.
For example, the plant transcriptome undergoes significant changes over the course of 24 hours regulated through the circadian clock and immediate light-signaling pathways, and more than 30% of transcripts can display diurnal oscillations. These fluctuations are not just shifting total RNA yield (which is elevated during the day to support high metabolic demand and photosynthesis), but represent profound qualitative changes in the steady-state levels of specific mRNA populations. This temporal variance also introduces significant challenges for RNA-Seq studies: inconsistent sampling times can lead to the misidentification of stimuli-induced expression and variance in overall composition can mask genuine changes in differential expression. In a recent study, Laosuntisuk et al. addressed this challenge through spike-in normalization and compare with conventional methods to confidently identify expression changes upon cold-stress in sorghum.
Abstract
RNA-Sequencing is widely used to investigate changes in gene expression at the transcription level in plants. Most plant RNA-Seq analysis pipelines base the normalization approaches on the assumption that total transcript levels do not vary between samples. However, this assumption has not been demonstrated. In fact, many common experimental treatments and genetic alterations affect transcription efficiency or RNA stability, resulting in unequal transcript abundance. The addition of synthetic RNA controls is a simple correction that controls for variation in total mRNA levels. However, adding spike-ins appropriately is challenging with complex plant tissue, and carefully considering how they are added is essential to their successful use. We demonstrate that adding external RNA spike-ins as a normalization control produces differences in RNA-Seq analysis compared to traditional normalization methods, even between two times of day in untreated plants. We illustrate the use of RNA spike-ins with 3′ RNA-Seq and present a normalization pipeline that accounts for differences in total transcriptional levels. We evaluate the effect of normalization methods on identifying differentially expressed genes in the context of identifying the effect of the time of day on gene expression and response to chilling stress in sorghum.
Similarly, conditions that lead to transcriptional shifts in mammals can also lead to confounding observations in RNA-Seq studies and can mask specific regulatory changes. The study by Jaeger et al. identified global effects of Mediator-dependent RNA polymerase II kinetics and an unexpected, CDK9-dependent compensatory feedback loop with the aid of spike-in controls.
Abstract
The Mediator complex directs signals from DNA-binding transcription factors to RNA polymerase II (Pol II). Despite this pivotal position, mechanistic understanding of Mediator in human cells remains incomplete. Here we quantified Mediator-controlled Pol II kinetics by coupling rapid subunit degradation with orthogonal experimental readouts. In agreement with a model of condensate-driven transcription initiation, large clusters of hypophosphorylated Pol II rapidly disassembled upon Mediator degradation. This was accompanied by a selective and pronounced disruption of cell-type-specifying transcriptional circuits, whose constituent genes featured exceptionally high rates of Pol II turnover. Notably, the transcriptional output of most other genes was largely unaffected by acute Mediator ablation. Maintenance of transcriptional activity at these genes was linked to an unexpected CDK9-dependent compensatory feedback loop that elevated Pol II pause release rates across the genome. Collectively, our work positions human Mediator as a globally acting coactivator that selectively safeguards the functionality of cell-type-specifying transcriptional networks.
Sophisticated artificial spike-in controls such as SIRVs mimic the natural complexity of transcripts (including alternatively spliced isoforms, exon skipping, intron retention, etc.) and can be used to identify endogenous transcript variations in different organisms. Calibration against the known SIRV isoforms enables benchmarking of transcript assembly algorithms, validate accuracy, and assess potentially false or missing assemblies (Schon et al., 2022).
For more insights into the use of complex spike-in controls for transcriptome assembly, read on into our paper talk with Michael Schon – Bridging gaps in transcript assembly.
Abstract
We developed Bookend, a package for transcript assembly that incorporates data from different RNA-seq techniques, with a focus on identifying and utilizing RNA 5′ and 3′ ends. We demonstrate that correct identification of transcript start and end sites is essential for precise full-length transcript assembly. Utilization of end-labeled reads present in full-length single-cell RNA-seq datasets dramatically improves the precision of transcript assembly in single cells. Finally, we show that hybrid assembly across short-read, long-read, and end-capture RNA-seq datasets from Arabidopsis thaliana, as well as meta-assembly of RNA-seq from single mouse embryonic stem cells, can produce reference-quality end-to-end transcript annotations.
In these studies, SIRVs provided a significant benefit for the analysis of RNA-Seq data, uncovering stimuli-specific gene expression changes, normalizing for global transcriptional changes, and providing a known set of isoforms to evaluate detection accuracy.
How to select the right SIRV spike-in set for your project?
| Application | Set 1 | Set 2 | Set 3 | Set 4 |
| Workflow validation | ✔ | — | — | — |
| Isoform analysis (discovery and quantification) | ✔ | ✔ | ✔ | ✔ |
| Transcript concentration and dynamic range | * | — | ✔ | ✔ |
| Length >2.5 kb, long read sequencing | — | — | — | ✔ |
| Length < 30 nt, small RNA sequencing | — | — | — | — |
| Number of spike-in transcripts | 69 isoforms in each Mix | 69 isoforms | 69 isoforms, 92 ERCCS | 69 isoforms, 92 ERCCS, 15 long SIRVs |
SIRV-Set 1 includes three different mixes termed E0, E1, and E2 and is recommended for workflow / pipeline validations. The three mixes contain isoforms in different concentrations for fold-change measurements and allow isoform detection and calibration of wet lab experiments, and data analysis pipelines.
SIRV-Set 3 is the allrounder, ideally used for any sample and any experiment. It allows absolute quantification within the sample and normalization across samples containing ERCCs for concentration-range measurements and SIRVs for isoform resolution.
SIRV-Set 4 contains ERCCs, isoforms and additionally long SIRVs covering transcript lengths of 4 kb, 6 kb, 8 kb, 10 kb, and 12 kb and is therefore ideally suited for long-read sequencing applications.
For small RNA-Seq studies, small RNA spike-ins are recommended.
Summary and key take aways
References
Chen, K., Hu, Z., Xia, Z. et al. (2015) The Overlooked Fact: Fundamental Need for Spike-In Control for Virtually All Genome-Wide Analyses. Mol Cell Biol. 36:662-667. DOI: 10.1128/MCB.00970-14. PMID: 26711261 ; PMCID: PMC4760223.
Hardwick, S.A., Deveson, I.W., and Mercer, T.R. (2017) Reference standards for next-generation sequencing. Nat Rev Genet. 18:473-484. DOI: 10.1038/nrg.2017.44 . PMID: 28626224 .
Jiang, L., Schlesinger, F., Davis, C. A., Zhang, Y., Li, R., Salit, M., Gingeras, T. R., and Oliver, B. (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Res 21: 1543-1551. DOI: 10.1101/gr.121095.111
Jaeger, M.G., Schwalb, B., Mackowiak, S.D. et al. (2020) Selective Mediator dependence of cell-type-specifying transcription. Nat Genet 52: 719–727. DOI: 10.1038/s41588-020-0635-0
Laosuntisuk, K., Vennapusa, A., Somayanda, I. M., et al. (2024) A normalization method that controls for total RNA abundance affects the identification of differentially expressed genes, revealing bias toward morning-expressed responses. Plant J 118: 1241-1257. DOI: 10.1111/tpj.16654
Lutzmayer, S., Enugutti, B., and Nodine, M.D. (2017) Novel small RNA spike-in oligonucleotides enable absolute normalization of small RNA-Seq data. Sci Rep. 7:5913. DOI: 10.1038/s41598-017-06174-3. PMID: 28724941 ; PMCID: PMC5517642.
Paul, L., Kubala, P., Horner, G., Ante, M., Holländer, I., Seitz, A., and Reda, T. (2016) SIRVs: Spike-In RNA Variants as External Isoform Controls in RNA-Sequencing. bioRxiv. DOI: 10.1101/080747
Risso, D., Ngai, J., Speed, T., and Dudoit, S. (2014) Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol 32: 896–902. DOI: 10.1038/nbt.2931
Schon, M.A., Lutzmayer, S., Hofmann, F., and Nodine, M.D. (2022) Bookend: precise transcript reconstruction with end-guided assembly. Genome Biol. 23:143. DOI: 10.1186/s13059-022-02700-3. Erratum in: Genome Biol. 2022 23:157. DOI: 10.1186/s13059-022-02725-8. PMID: 35768836 ; PMCID: PMC9245221.
Zhao, Y., Li, MC., Konaté, M.M. et al. (2021) TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository. J Transl Med 19, 269. DOI: 10.1186/s12967-021-02936-w
Written by Dr. Yvonne Göpel
