Modular Design

Transcriptome Complexity in a Nutshell

From their conception in 2013, the Spike-In RNA Variants, SIRVs, were designed to develop into a series of modules which mimic transcriptome complexity in a condensed manner, each module probing a specific component. Deviations of the results from the expected values can be related to specific flaws in a sample, experiment, or entire pipeline. Transcription and splicing variants were highest in demand and therefore realized first. Since the SIRV isoforms are complementary to the ERCCs, these two modules are available in one SIRV set and synergistically deliver the required information about dynamic range, lower detection limit, input-output linearity, and performance in transcript variant detection and quantification.

Additional modules can be envisioned to broaden the scope of the SIRVs and to establish a continuous and increasingly comprehensive referencing method for RNA sequencing experiments (Figure 1).

SIRVs_Figure2

Figure 1 ǀ SIRV modules. The SIRV isoforms and the ERCC single-isoform transcripts are established synthetic RNA molecules that mimic two major aspects of transcriptome complexity, isoforms and concentrations. The SIRVome is the corresponding artificial reference genome. Additional modules are envisioned and will be complementary to existing ones, and numbers refer to the realized number of transcripts per module (Paul et al., 2016).

If you are interested in participating in the design of further external RNA control modules, please contact us at info@lexogen.com, and also consider to

The SIRVome

The SIRVs are annotated within a SIRV genome which carries currently the 7 SIRV isoform genes and 92 ERCC genes (Figure 2).

SIRVs_Figure2

Figure 2 ǀ SIRVome and SIRV transcripts. The SIRV isoform and ERCC genes are lined up on a SIRVome. Pullout, the Compact Coverage Visualization (CCV) shows intron regions common to all transcripts as short standardized gaps irrespective of the original sequence length, which provides a more comparable overview of the actual SIRV transcripts.

Some data analysis methods require, or benefit, from comprehensive gene definitions. These are provided for the SIRV isoforms in the annotation files, with a further 1 kB of sequences defined upstream and downstream of the first and last exon, respectively. These random sequences were created similarly to the intron sequences by mirroring the G/C content of the exon sequences and not matching to nucleotide database entries (search window 27 bp).

Isoform Module

The SIRV isoforms are a set of 69 artificial transcript variants. These were derived from 7 human model genes, and their annotated transcripts were complemented by additional isoforms to comprehensively reflect variations of alternative splicing, alternative transcription start- and end-sites, overlapping genes, and antisense transcripts. The 7 synthetic genes contain between 6 and 18 transcript variants each, which are on average 9.9 physically realized alternative isoforms, and more when accounting additional provisions in the annotations against which a pipeline can be tested. The SIRV isoform module is available in the form of three mixes, with molar ratios of transcripts in mixes E0, E1, and E2 at magnitudes of 0, 1, and 2, respectively.

Condensed Transcription and Isoform Complexity

The structures of seven human model genes (KLK5, LDHD, LGALS17A, DAPK3, HAUS5, USF2, and TESK2) were used as scaffold for the design of genes SIRV1 to SIRV7. The ENCODE-annotated transcripts as well as additional variants were edited to comprehensively present transcription variations like different start- and end-site usages, alternative splicing, overlapping genes, and antisense transcription.

For the sake of simplicity, we will refer to all these transcriptional variants as isoforms, although in a strict sense this would refer only to the splice variants. The term “SIRV isoforms” therefore covers alternative splicing as well as differential promoter and terminator usage, antisense transcription and overlapping genes (Figure 3).

SIRV-Design_Figure3

Figure 3 ǀ SIRV isoform design overview. The aim of the SIRV isoform design was to mimic human model genes to represent in their entirety all main aspects of alternative splicing and transcription in numerous repeats and variations. The transcript isoforms are shown aligned to a “master gene” (top line), and hence there can be no intron retention event. Therefore, the opposite is described here as “exon splitting”. The sequences themselves have no significant similarities to any known data base entries but match eukaryotic gene features in terms of their makeup and exon-intron structure. A5SS and A3SS, alternative 5’/3′ splice sites; MXE, mutually exclusive exons.

Figure 4 illustrates in one example how the human gene KLK5 served as a blue print for the design of the gene SIRV1. In addition to the 8 realized SIRV1 transcripts, 4 more were designed, which only exist in the over-annotation file provided together with the correct annotation. Vice versa, 3 transcripts of the existing SIRV1 set are not present in the insufficient annotation file. This way, transcript isoform detecting and quantifying algorithms can be challenged for their robustness towards real-life scenarios, in which the transcripts in a sample do not align with the available annotation.

SIRV-Design_Figure4

Figure 4 | Design path and exon-intron structures of the SIRV1 gene. The SIRV1 gene was derived from the human KLK5 gene, with transcripts added to the Ensembl-annotated ones to achieve a comprehensive transcriptome complexity. Transcripts in blue are part of SIRV isoform mixes, transcripts in green are only part of an over-annotation. (i) Refers to transcripts that are omitted in an incomplete annotation. The polyadenylated 3’ end is marked in red, indicating sense and antisense orientations.

Together, the 7 SIRV isoform genes model comprehensively and in a condensed and redundant manner transcription and alternative splicing variations (Table 1).

Table 1 | Summary of splice and transcription variations per SIRV isoform gene. The occurrences of the different events are counted for each transcript in reference to a hypothetical master transcript of maximal length containing all exon sequences from all transcript variants of a given gene. Therefore, in a formal sense no intron retention can occur, but this event is defined as exon splitting caused by the introduction of an intron sequence (illustrated in Figure 4).

Alternative 1st exon Start site variation Alternative 5′ splice site Alternative 3′ splice site Exon skipping Exon splitting End site variation Alternative last exon
SIRV1 5 4 5 2 2 3 4 1
SIRV2 1 3 3 2 0 3 2 2
SIRV3 1 5 5 4 5 4 7 4
SIRV4 4 2 2 4 2 1 5 3
SIRV5 3 9 6 8 5 17 7 7
SIRV6 9 10 7 26 27 28 13 3
SIRV7 2 5 1 1 31 1 4 3

The transcripts of a SIRV isoform gene are assigned to 1 of 4 SubMixes to enable preset ratios in mixes E0, E1, and E2 (see SIRV-Set 1 in SIRV Sets for more information).

Figure 5 illustrates how the known input amounts of transcripts of a SIRV isoform gene allow for precise modeling of ideal coverage expectations. RNA-Seq introduces biases to the read distribution, and hence the experimental coverage will deviate from the expected one.

SIRVs_Figure2

Figure 5 | Comparison of the expected and the measured coverages for the SIRV3 locus in the equimolar Mix E0. Top: individual transcripts of SIRV3 with transcripts on the plus strand in blue and in red for the ones on the minus strand. Color code indicates the SubMix allocation. Bottom: the expected SIRV3 coverage is shown as superposition of individual transcript coverages, in which the terminal sites have been modelled by a transient error function. The measured coverages after read mapping by TopHat2 are shown in grey. The measured coverages and number of splice junction reads were normalized to obtained identical areas under the curves and identical sums of all junctions for the expected and measured data. The measured splice junction reads are shown by the numbers before the brackets, while the expected values are shown inside the brackets. The CoD (Coefficient of Deviation) values are given for the plus and minus strand in the respective colors.

Native Gene Features but Unique Sequences

The SIRV isoform transcripts range in length from 191 to 2528 nt (mean 1134 nt; median 813 nt), which includes a 30 nt long poly(A)-tail. The GC-content varies between 29.5 and 51.2 % (mean 43.0 %; median 43.6 %). The exon sequences were created from a pool of database-derived genomes (gene fragments from viruses and bacteriophage capsid proteins and glycoproteins) and modified by inverting the sequence to lose identity while maintaining a naturally occurring order in the sequences.

The splice junctions conform to 96.9 % to the canonical GT-AG exon-intron junction rule with few exceptions harboring the less frequently occurring variations GC-AG (1.7 %) and AT-AC (0.6 %). Two non-canonical splice sites, CT-AG and CT-AC, account for 0.4 % each. Intron sequences that do not align with exons of another isoform were drawn from random sequences whereby the GC content was balanced to comply with the adjacent exonic sequences. The sequence exclusivity was verified by blasting the exon sequences against the entire NCBI database including ERCCs on the nucleotide and protein level. The artificial SIRV isoform sequences are suitable for noninterfering qualitative and quantitative assessments in the context of known genomic systems and complementary to the ERCC sequences (NIST SRM 2374). The SIRV isoform sequences were deposited at the NCBI’s GenBank (accession numbers KX147759 to -65 for SIRV1 to SIRV7) and can be downloaded in the Downloads section.

Certain data analysis approaches require or benefit from comprehensive gene definitions. These are provided for the SIRV isoforms in the annotation files, with a further 1 kB of sequences defined upstream and downstream of the first and last exon, respectively. These random sequences were created similarly to the intron sequences by mirroring the G/C content of the exon sequences and not matching to nucleotide database entries (search window 27 bp).

Correct, insufficient and over-annotations of SIRV isoforms

The a priori knowledge of SIRV transcript sequences and concentrations allows to assess the isoform-specific performance of an RNA-Seq experiment. In addition to the correct annotation of the SIRV isoforms, one insufficient and one over-annotation are supplied to enable the testing of NGS data evaluation algorithms for their robustness towards “real-life”, imperfect annotations (see also Figure 4). More annotations can be added to emulate situations of evolving reference annotations which accumulate transcripts discovered in samples of different origin.

ERCC Module

The ERCC RNA Spike-In Controls provide a set of 92 artificial transcripts with non-overlapping sequences. These were developed by the External RNA Controls Consortium (ERCC), a group of academic, private, and public organizations hosted by the National Institute of Standards and Technology (NIST) to enable the standardized assessment gene expression platforms such as quantitative RT-PCR, microarrays, and NGS technologies (External RNA Controls Consortium 2005; Baker et al. 2005). Due to their unique sequence identities (Figure 6), the ERCC controls are well suited for measuring technical parameters irrespective of isoforms.

SIRV-Design_Figure6

Figure 6 ǀ Single-isoform nature of ERCCs. ERCC transcripts follow the 1 gene, 1 exon, 1 transcript layout, providing each ERCC transcript with a unique sequence identity. Genes (exons) are shown in white, derived transcripts in blue, and the poly(A) tail is indicated in red to specify transcript 5’-3’ orientation.

The ERCCs were used in exemplary studies by the FDA Sequencing Quality Control (SEQC) Consortium and the Association of Biomolecular Resource Facilities (ABRF) (Li et al. 2014a; Li et al. 2014b; SEQC/MAQC-III Consortium 2014; Xu et al. 2014). Comparisons of the assigned and evaluated reads with known concentrations allow for the assessment of dynamic range, dose response, lower limit of detection and efficiency, as well as fold-change response of RNA sequencing pipelines, within the complexity boundaries of monoexonic, non-overlapping RNA sequences.

The RNAs are transcribed from a plasmid DNA library of ERCC sequences, available as a standard reference material from NIST (SRM 2374) (National Institute of Standards and Technology). The complete library comprises 96 unique sequences, and 92 of these were mixed in the form of transcripts assigned to four subpools with 23 ERCC controls each. Within each subpool ERCC abundances span a 220 (106) dynamic range. Similar to the SIRV isoforms, the ERCC transcripts contain a triphosphate guanosine at their 5’ end and a poly(A) tail at their 3’ end, which in the case of the ERCCs is 20-26 nt long (SIRVs; 30 nt). ERCCs on their own are available from Thermo Fisher Scientific as ERCC RNA Spike-In Mix (Cat. No. 4456740,) and ERCC ExFold RNA Spike-In Mixes (Cat. No. 4456739). The 92 ERCC sequences are available in the Downloads section.