Modular Design

Transcriptome Complexity in a Nutshell

From their conception in 2013, the Spike-In RNA Variants, SIRVs, were designed to develop into a series of modules which mimic transcriptome complexity in a condensed manner, each module probing a specific component (Figure1). Deviations of the results from the expected values can be related to specific flaws in a sample, experiment, or entire pipeline. Transcription and splicing variants were highest in demand and therefore realized first. Since the SIRV isoforms are complementary to the ERCCs, these two modules are available as a combination in SIRV-Set 3 and synergistically deliver the required information about dynamic range, lower detection limit, input-output linearity, and performance in transcript variant detection and quantification. The long SIRV module – containing 15 transcripts with lengths between 4 kb and 12 kb – was added to the portfolio in 2020 for the assessment of the length aspect of transcriptome complexity. This module is available in combination with the SIRV isoforms and the ERCCs in SIRV-Set 4.

SIRV-modules

Figure 1 ǀ SIRV modules. The SIRV isoforms, single-isoform transcripts (ERCCs), and long SIRVs are established synthetic RNA molecules that mimic three aspects of transcriptome complexity, isoforms, abundance, and transcript length. The SIRVome is the corresponding artificial reference genome.

If you are interested in participating in the design of further external RNA control modules, please contact us at info@lexogen.com, and also consider to

The SIRVome

The SIRVs are annotated within a SIRV genome which carries currently the 7 SIRV isoform gene loci, 92 ERCC genes, and 15 long SIRV genes (Figure 2).

SIRVome-and-SIRV-transcripts

Figure 2 ǀ SIRVome and SIRV transcripts. SIRV isoform gene loci, ERCC genes, and long SIRV genes are lined up on a SIRVome. Pullout, the Compact Coverage Visualization (CCV) shows intron regions common to all transcripts as short standardized gaps irrespective of the original sequence length, which provides a more comparable overview of the actual SIRV isoform transcripts.

Some data analysis methods require, or benefit, from comprehensive gene definitions. These are provided for the SIRV isoforms, the ERCCs, and the long SIRVs in the annotation files, with a further 1 kB of sequences defined upstream and downstream of the first and last exon, respectively. These random sequences were created similarly to the intron sequences by mirroring the G/C content of the exon sequences and not matching to nucleotide database entries (search window 27 bp).

Isoform Module

The SIRV isoforms are a set of 69 artificial transcript variants. These were derived from 7 human model genes, and their annotated transcripts were complemented by additional isoforms to comprehensively reflect variations of alternative splicing, alternative transcription start- and end-sites, overlapping genes, and antisense transcripts. The 7 synthetic genes loci contain between 6 and 18 transcript variants each, which are on average 9.9 physically realized alternative isoforms, and more when accounting additional provisions in the annotations against which a pipeline can be tested. The SIRV isoform module is available in the form of three mixes, with molar ratios of transcripts in mixes E0, E1, and E2 at magnitudes of 0, 1, and 2, respectively.

Condensed Transcription and Isoform Complexity

The structures of seven human model genes (KLK5, LDHD, LGALS17A, DAPK3, HAUS5, USF2, and TESK2) were used as scaffold for the design of genes SIRV1 to SIRV7. The ENCODE-annotated transcripts as well as additional variants were edited to comprehensively present transcription variations like different start- and end-site usages, alternative splicing, overlapping genes, and antisense transcription.

For the sake of simplicity, we will refer to all these transcriptional variants as isoforms, although in a strict sense this would refer only to the splice variants. The term “SIRV isoforms” therefore covers alternative splicing as well as differential promoter and terminator usage, antisense transcription and overlapping genes (Figure 3).

SIRV-Design_Figure3

Figure 3 ǀ SIRV isoform design overview. The aim of the SIRV isoform design was to mimic human model genes to represent in their entirety all main aspects of alternative splicing and transcription in numerous repeats and variations. The transcript isoforms are shown aligned to a “master gene” (top line), and hence there can be no intron retention event. Therefore, the opposite is described here as “exon splitting”. The sequences themselves have no significant similarities to any known data base entries but match eukaryotic gene features in terms of their makeup and exon-intron structure. A5SS and A3SS, alternative 5’/3′ splice sites; MXE, mutually exclusive exons.

Figure 4 illustrates in one example how the human gene KLK5 served as a blue print for the design of the gene SIRV1. In addition to the 8 realized SIRV1 transcripts, 4 more were designed, which only exist in the over-annotation file provided together with the correct annotation. Vice versa, 3 transcripts of the existing SIRV1 set are not present in the insufficient annotation file. This way, transcript isoform detecting and quantifying algorithms can be challenged for their robustness towards real-life scenarios, in which the transcripts in a sample do not align with the available annotation.

SIRV-Design_Figure4

Figure 4 | Design path and exon-intron structures of the SIRV1 gene. The SIRV1 gene was derived from the human KLK5 gene, with transcripts added to the Ensembl-annotated ones to achieve a comprehensive transcriptome complexity. Transcripts in blue are part of SIRV isoform mixes, transcripts in green are only part of an over-annotation. (i) Refers to transcripts that are omitted in an incomplete annotation. The polyadenylated 3’ end is marked in red, indicating sense and antisense orientations.

Together, the 7 SIRV isoform genes model comprehensively and in a condensed and redundant manner transcription and alternative splicing variations (Table 1).

Table 1 | Summary of splice and transcription variations per SIRV isoform gene.

  Alternative 1st exon Start site variation Alternative 5′ splice site Alternative 3′ splice site Exon skipping Exon splitting End site variation Alternative last exon
SIRV1 5 4 5 2 2 3 4 1
SIRV2 1 3 3 2 0 3 2 2
SIRV3 1 5 5 4 5 4 7 4
SIRV4 4 2 2 4 2 1 5 3
SIRV5 3 9 6 8 5 17 7 7
SIRV6 9 10 7 26 27 28 13 3
SIRV7 2 5 1 1 31 1 4 3

.

The occurrences of the different events are counted for each transcript in reference to a hypothetical master transcript of maximal length containing all exon sequences from all transcript variants of a given gene. Therefore, in a formal sense no intron retention can occur, but this event is defined as exon splitting caused by the introduction of an intron sequence (illustrated in Figure 4).

The transcripts of a SIRV isoform gene are assigned to 1 of 4 SubMixes to enable preset ratios in mixes E0, E1, and E2 (see SIRV-Set 1 in SIRV Sets for more information).

Figure 5 illustrates how the known input amounts of transcripts of a SIRV isoform gene allow for precise modeling of ideal coverage expectations. RNA-Seq introduces biases to the read distribution, and hence the experimental coverage will deviate from the expected one.

SIRVs_Figure2

Figure 5 | Comparison of the expected and the measured coverages for the SIRV3 locus in the equimolar Mix E0. Top: individual transcripts of SIRV3 with transcripts on the plus strand in blue and in red for the ones on the minus strand. Color code indicates the SubMix allocation. Bottom: the expected SIRV3 coverage is shown as superposition of individual transcript coverages, in which the terminal sites have been modelled by a transient error function. The measured coverages after read mapping by TopHat2 are shown in grey. The measured coverages and number of splice junction reads were normalized to obtained identical areas under the curves and identical sums of all junctions for the expected and measured data. The measured splice junction reads are shown by the numbers before the brackets, while the expected values are shown inside the brackets. The CoD (Coefficient of Deviation) values are given for the plus and minus strand in the respective colors.

Native Gene Features but Unique Sequences

The SIRV isoform transcripts range in length from 191 to 2528 nt (mean 1134 nt; median 813 nt), which includes a 30 nt long poly(A)-tail. The GC-content varies between 29.5 and 51.2 % (mean 43.0 %; median 43.6 %). The exon sequences were created from a pool of database-derived genomes (gene fragments from viruses and bacteriophage capsid proteins and glycoproteins) and modified by inverting the sequence to lose identity while maintaining a naturally occurring order in the sequences.

The splice junctions conform to 96.9 % to the canonical GT-AG exon-intron junction rule with few exceptions harboring the less frequently occurring variations GC-AG (1.7 %) and AT-AC (0.6 %). Two non-canonical splice sites, CT-AG and CT-AC, account for 0.4 % each. Intron sequences that do not align with exons of another isoform were drawn from random sequences whereby the GC content was balanced to comply with the adjacent exonic sequences. The sequence exclusivity was verified by blasting the exon sequences against the entire NCBI database including ERCCs on the nucleotide and protein level. The artificial SIRV isoform sequences are suitable for noninterfering qualitative and quantitative assessments in the context of known genomic systems and complementary to the ERCC sequences (NIST SRM 2374). The SIRV isoform sequences were deposited at the NCBI’s GenBank (accession numbers KX147759 to -65 for SIRV1 to SIRV7) and can be downloaded in the Downloads section.

Certain data analysis approaches require or benefit from comprehensive gene definitions. These are provided for the SIRV isoforms in the annotation files, with a further 1 kB of sequences defined upstream and downstream of the first and last exon, respectively. These random sequences were created similarly to the intron sequences by mirroring the G/C content of the exon sequences and not matching to nucleotide database entries (search window 27 bp).

Correct, insufficient and over-annotations of SIRV isoforms

The a priori knowledge of SIRV transcript sequences and concentrations allows to assess the isoform-specific performance of an RNA-Seq experiment. In addition to the correct annotation of the SIRV isoforms, one insufficient and one over-annotation are supplied to enable the testing of NGS data evaluation algorithms for their robustness towards “real-life”, imperfect annotations (see also Figure 4). More annotations can be added to emulate situations of evolving reference annotations which accumulate transcripts discovered in samples of different origin.

ERCC Module

The ERCC RNA Spike-In Controls provide a set of 92 artificial transcripts with non-overlapping sequences. These were developed by the External RNA Controls Consortium (ERCC), a group of academic, private, and public organizations hosted by the National Institute of Standards and Technology (NIST) to enable the standardized assessment gene expression platforms such as quantitative RT-PCR, microarrays, and NGS technologies (External RNA Controls Consortium 2005; Baker et al. 2005). Due to their unique sequence identities (Figure 6), the ERCC controls are well suited for measuring technical parameters irrespective of isoforms.

SIRV-Design_Figure6

Figure 6 ǀ Single-isoform nature of ERCCs. ERCC transcripts follow the 1 gene, 1 exon, 1 transcript layout, providing each ERCC transcript with a unique sequence identity. Genes (exons) are shown in white, derived transcripts in blue, and the poly(A) tail is indicated in red to specify transcript 5’-3’ orientation. Note that while there are 92 ERCC transcripts in the mix, the RNAs are numbered non-consecutively up to 171.

The ERCCs were used in exemplary studies by the FDA Sequencing Quality Control (SEQC) Consortium and the Association of Biomolecular Resource Facilities (ABRF) (Li et al. 2014a; Li et al. 2014b; SEQC/MAQC-III Consortium 2014; Xu et al. 2014). Comparisons of the assigned and evaluated reads with known concentrations allow for the assessment of dynamic range, dose response, lower limit of detection and efficiency, as well as fold-change response of RNA sequencing pipelines, within the complexity boundaries of monoexonic, non-overlapping RNA sequences.

The RNAs are transcribed from a plasmid DNA library of ERCC sequences, available as a standard reference material from NIST (SRM 2374) (National Institute of Standards and Technology). The complete library comprises 96 unique sequences, and 92 of these were mixed in the form of transcripts assigned to four subpools with 23 ERCC controls each. Within each subpool ERCC abundances span a 220 (106) dynamic range. Similar to the SIRV isoforms, the ERCC transcripts contain a triphosphate guanosine at their 5’ end and a poly(A) tail at their 3’ end, which in the case of the ERCCs is 20-26 nt long (SIRVs; 30 nt). ERCCs on their own are available from Thermo Fisher Scientific as ERCC RNA Spike-In Mix (Cat. No. 4456740,) and ERCC ExFold RNA Spike-In Mixes (Cat. No. 4456739). The 92 ERCC sequences are available in the Downloads section.

Long SIRVs Module

The introduction of long read sequencing platforms like Pacific Biosciences™ and Oxford Nanopore Technologies™ has significantly increased the available read length, now easily exceeding the average transcript length. The ERCC and SIRV isoform modules are optimized for assessing RNA abundance and isoform complexity aspects. However, the average ERCC length is 909 nt (max. 2036 nt), the average SIRV isoform length is 1134 nt (max. 2528 nt), and thus spike-in RNA transcripts of both modules are below the average reported length for eukaryotic protein-coding mRNAs (e.g. 3.5 kb for the human transcriptome (Piovesan et al., 2019)).

Lexogen has therefore developed “long SIRVs”, a module that contains three different transcripts for each of the five length categories 4 kb, 6 kb, 8 kb, 10 kb, and 12 kb (Figure 7). These RNAs cover the length of the majority of cellular transcripts. The sequence of each of these 15 RNAs is unique and does not overlap with any other spike-in or endogenous transcripts (similar to the ERCC module). Therefore, the equimolar long SIRVs are optimal tools to evaluate the transcript length aspect in RNA-Seq workflows. While designed in particular to assess long-read platforms, long SIRVs reveal length dependencies also in short read workflows.

Long-SIRVs-SIRV4001-SIRV12003

Figure 7 | Long SIRVs SIRV4001-SIRV12003. Long SIRVs follow the 1 gene, 1 exon, 1 transcript layout with three genes / transcripts corresponding to each of the five different length categories: 4 kb, 6 kb, 8 kb, 10 kb and 12kb. Genes (exons) are shown in white, derived transcripts in blue, and the poly(A) tail is marked in red to indicate transcript 5’-3’ orientation.

Sequence and annotation

Certain data analysis approaches require or benefit from comprehensive gene definitions. These are provided for the long SIRVs in the annotation files, with a further 1 kB of sequences defined upstream and downstream of the first and last exon, respectively. These random sequences were created mirroring the G/C content of the exon sequences and not matching to nucleotide database entries (search window 27 bp). The long SIRV sequences can be downloaded in the Downloads section.