Active genomes are continuously transcribed into a multitude of spliced coding and noncoding RNAs (Okazaki et al. 2002). The aim of RNA sequencing (RNA-Seq) is to decipher types and number of sequences to characterize the state of cells, tissues and organs for biological and medical research (Mortazavi et al. 2008, Cloonan et al. 2008, Pan et al. 2008), and to develop comprehensive assays for clinical applications (Byron e al. 2016).
Technical variability in RNA-Seq
RNA sequencing workflows comprise many reactions and operations which are prone to inherent variabilities. Workflows start with RNA purification, continue with library generation, followed by the sequencing itself, which finally leads to the evaluation of the sequenced fragments (Wang et al. 2009). The first steps impose numerous, whether or not intended, biases towards RNA classes and sequence characteristics, which data processing algorithms try to compensate for afterwards (Li et al. 2010, Meacham et al. 2011, Nakamura et al. 2011, Tarazona et al. 2011, Teng et al. 2016, van Dijk et al. 2014, Wall et al. 2014, Zheng et al. 2011).
Controls in RNA-Seq
External controls are RNA molecules of known sequence that are added in known amounts to a sample. By these means, controls pass together with the endogenous RNA through the same protocol steps and record biases (ERCC Consortium 2005, Hardwick et al. 2016, Leshkowitz et al. 2016).
SIRVs (Spike-in RNA Variant Control Mixes)
To address this gap Lexogen has conceived Spike-In RNA Variants (SIRVs) for the quantification of mRNA isoforms in Next Generation Sequencing (NGS). SIRVs are a set of 69 artificial transcript variants which are derived from 7 human model genes that are complemented by additional isoforms to comprehensively reflect variations of alternative splicing, alternative transcription start- and end-sites, overlapping genes, and antisense transcripts. The 7 synthetic genes contain between 6 and 18 transcript variants each, which are on average 9.9 physically realized alternative isoforms, and more when accounting additional provisions in the annotations against which a pipeline can be tested. The SIRVs are provided as three mixes, with molar ratios differing up to two orders of magnitudes. The Mix E0 contains all SIRVs at the same molarity.
Validation of RNA-Seq pipelines and concordance of experiments
The a priori knowledge of SIRV transcript sequences and concentrations allows to assess the isoform-specific performance of an RNA-Seq experiment. In addition to the correct annotation of the SIRVs, one insufficient and one over-annotation are supplied to enable the testing of NGS data evaluation algorithms for their robustness towards “real life”, imperfect annotations. More annotations can be added to emulate situations of evolving reference annotations which accumulate transcripts discovered in samples of different origin. With SIRVs the quality metrics precision and accuracy of mapping, isoform assembly and quantification can be measured. Importantly, the sample comparison on the basis of SIRV subsets derive meaningful concordance values which make for the first time isoform-quantification based experiments comparable.
Spiking of RNA samples with SIRV mixes of known concentrations together with the a priori knowledge of the gene annotations allow for a unique assessment of any RNA-Seq workflow that relies on transcript variants data.
Figure 1 ǀ Workflow for using SIRV controls in RNA-Seq. SIRVs are defined artificial RNA molecules which mimic main aspects of the complexity of transcriptomes. They are added in minuscule amounts to samples before library preparation to undergo together with the endogenous RNA the very same processing steps. After mapping the reads to the combined genome and SIRVome the SIRV control data are used to analyse the quality metrics and to categorize the experiments. The small subset of control data can be searched against data base to identify experiments of high concordance which can then be used for meaningful differential expression analyses. The dotted lines show the decision making processes of deciding i) if the complete data set is worth to process further (or if an experiment needs to be repeated), and ii) which data sets have such good concordance that it is worth to compare the full data sets to each other.
SIRVs are spiked into RNA samples during the purification process before library preparation. The minuscule spike-in amounts target typically 1 % of the RNA fractions of interest, e.g., total RNA, ribosomal depleted RNA or poly(A)-enriched RNA. Therefore, the spike-in amounts might need to be varied depending on the type and amount of sample. Alternatively, spike-in amounts might be kept constant to measure variations in the sample like the mRNA content or metabolic states. Because experimental hypotheses and problems vary we provide with the Experiment Designer in the SIRV Suite an interactive tool for developing working hypotheses based on known or estimated parameters. The Experiment Designer shows the rational for using particular spike-in amounts.
Library Preparation and Sequencing
SIRVs undergo together with the endogenous RNA the very same protocol reaction steps of the library preparation and sequencing. The sequencing data file contains reads from SIRVs and endogenous RNA.
Reads are distinguished while mapping to the reference genome and SIRVome. As result two linked data sets are generated. Because the SIRV data contain only a fraction of all reads the SIRV subset evaluation is not only fast but unique quality metrics can be calculated which are • precision as a measure for the repeatability, • comparison of the measured coverage compared to the expected coverage with calculating the coefficient of deviation (CoD), and • accuracy which can only be determined from known inputs. Because data evaluation can be carried out in many different ways one standardized pipeline is implemented in the Evaluator of the SIRV Suite.
The SIRV data include metadata, mapped reads, and after processing by the Evaluator also the quality metrics, can be stored in the data repository which is administered by the SIRV Suite. The main function of the Comparator is to compare different SIRV data sets by calculating • concordance values based on solely the small subset of SIRV controls. Because the subsets of SIRV data contains only approx. 1 % of the sequencing data all comparisons require proportionally less computational power and ensure fast processing. Differences between the accompanying SIRV data are seen in similar proportions in the main data of the endogenous RNA. Concordance is independent of the accuracy but describes the coherence of data sets, and identifies endogenous RNA data which are suitable for comparing, e.g., in differential expression analyses.
At present comparisons are carried out only in exemplary inter-laboratory studies on reference RNA samples which investigate different RNA treatments, NGS platforms and data evaluation algorithms (SEQC/MAQC-III Consortium 2014; Li et al. 2014).
First attempts to study the quality of RNA-Seq pipelines on the transcript-isoform level were made by using mouse spike-in control transcripts (Leshkowitz et al. 2016) which demonstrated that abundance estimation of multiple isoform spike-ins produce lower duplicate correlations at transcript level than gene level. These experiments used endogenous but not expressed mouse transcripts as judged by earlier micro array measurements which makes this approach time consuming, costly, and foremost not general applicable because each sample requires its own custom built spike-ins. Although comparisons of different bioinformatics tools unveil severe differences a straight series of quality metrics has not been implemented for comparing results from the known controls with the unknown endogenous RNA.
The SIRV were conceived in autumn 2013, and in July 2014 the SIRV design, quality targets and production were presented and discussed at the first ERCC 2.0 workshop hosted by the National Institute of Standards and Technology Advances in Biological/Medical Measurement Science Program (NIST-ABMS) (Munro and Salit, 2014). SIRVs were introduced with a test program in June 2015, and are commercially available since September 2015.
In Sept 2016 the Garvan Institute published with the Sequins a complementary RNA spike-in system of also naturally derived inverted sequences which represent on average just 2.1, and up to 4, isoforms per gene as such that 164 isoforms are distributed across 78 genes (Hardwick et al., 2016). Although, as judged by cumulative frequency histograms the artificial, gene loci correspond well to the human transcriptome structure and annotation, from where the inverted sequences were actually drawn from in the first place, the Sequins map many different features to the same RNA molecules which hinders the systematic unambiguous analysis of RNA-Seq pipelines and experiments. Performance at boundary conditions are difficult to resolve as Sequins are distributed across a wide concentration range of up to 6 orders of magnitude alike the monocistronic ERCCs errors caused by a pipeline cannot be unambiguously attributed to difficulties caused by either too complex annotation patterns or just by low sequence coverage. Despite the large number of isoforms the mutual exclusive exon proportion is high and the density of multiple sequence coverage by different isoforms is low.