Data Analysis

General Workflow

Reads from a spike-in RNA-Seq experiment are processed alongside the all reads from one NGS library which comprises quality control, de-multiplexing, and – depending on the library preparation protocol – trimming. These reads are then mapped to a combination of, if available, the genomic reference and the SIRVome (the artificial “genome” detailing the spike-in sequences and annotations). Only a small percentage of the reads are expected to map to the SIRVome-Now, only these shares need to be analyzed to derive a representative information about the majority of reads which mapped to the endogenous sample RNA.

SIRV reads can be analyzed by standard and custom bioinformatic tools which compare at different levels from raw read mapping up to transcript identification and quantification the results from measured with the expected read distribution.

Data evaluation of the SIRV isoforms will be demonstrated in the chapter below by using the pipeline established in the ”SIRV Suite“. The tool allows cross-sample comparing and referencing.

For the data evaluation of ERCC reads in RNA-Seq experiments, the NIST provides a software package called the “ERCC dashboard” (Munro et al., 2014), and further evaluations are described in publications of the SEQC/MAQC-III Consortium (2014).

A complementary software package as either command lines, or Galaxy installation for a standardized evaluation of both modules, SIRV isoforms and the ERCC single-isoform transcripts, is in preparation.

SIRV Suite

Galaxy SIRV Suite is a set of software tools accompanying the SIRVs product by Lexogen. It allows you to design and evaluate your SIRV experiment, and to compare it to other similar experiments. The SIRV Suite streamlines and unifies the data evaluation process.

DESIGNER Planning of experiments
EVALUATOR Evaluating SIRV data to obtain a standardized but customisable report and data files for downstream applications
COMPARATOR Cross-sample comparisons, and referencing
DATA LIBRARIES Data base for experiment metadata, SIRV reads, and quality metrics

SIRV Suite Galaxy Workflow

The SIRV Suite is a program package which has been embedded in Galaxy running on a CloudMan instance. Data processing and reporting can be tested by simply uploading some NGS data. The Galaxy installation with SIRV Suite can be cloned to separate CloudMan instances. Alternatively, the SIRV Suite can be installed into already existing, or new, Galaxy installations.

The SIRV Suite can be combined with other bioinformatics and statistics programs to integrate the SIRV data evaluation into existing NGS data analyses workflows (


Figure 1 ǀ SIRV Suite start page. Screenshot of the start page with menu.

Experiment Designer

The Experiment Designer is an interactive tool which allows to design the control experiments by developing working hypotheses based on measured or estimated values about mRNA content, workflow efficiency, and targeted fraction, e.g. total RNA, rRNA depleted, poly(A)-enriched, or resultant combinations.


Figure 2 ǀ SIRV Suite Experiment Designer. The interactive tool uses known or estimated specifications of the amount and type of RNA in the sample, and shows the relative abundances of the SIRV isoform mixes within the expected concentration range of the sample RNA.

The Experiment designer demonstrates how the mass of your RNA sample input is typically reflected in TPM values of NGS experiments using assumptions about average transcript length of 2 kb, and a dynamic range of 6 orders of magnitude. The same effect is shown for the SIRVs, and ERCCs for comparison. The visualization shows how the concentration range of the controls matches the endogenous RNA. The calculations are done dynamically, so the effect of any parameter change on the samples can be seen immediately.

Attention! Typical experiments require SIRV spike-in amounts in the picogram range, while single cell experiments require amounts reaching in the low femtogram range. Guidelines for preparing appropriate dilutions can be found in the SIRV user guide. A good mixing of appropriate volumes which obey the operation limits of pipettes or dispensers, the use of low binding plastics ware, as well as fast and sterile processing are essential when working with RNA.


The endogenous RNA and SIRV controls undergo together the very same reaction steps of library preparation and sequencing to obtain combined raw read data sets.

Experiment Specifier

The Experiment Specifier is used to gather information of experiment and samples. It generates a .JSON experiment specification file for further downstream processing. Samples must carry unique identifiers.

Experiment Evaluator

The Evaluator processes experimental data from SIRV controls and endogenous RNA. Data can be submitted at different stages of the NGS read processing. The result is a .ZIP file which contains a report (.PDF), with evaluation values in tabular form (.CSV), and figures (.SVG).


Figure 3 ǀ SIRV Suite Evaluator flow scheme. Different data input types, i, which are either demultiplexed raw read files (.FASTQ), quality controlled and trimmed read files, mapped reads (.BAM), or already computed abundance estimation tables (.CSV files with TPM values) can be processed.


i1, and i2, because tools for trimming and mapping are currently not preinstalled the read processing and mapping must be carried out in a separate workflow, or coupled Galaxy project. i3, starting with mapped reads in .BAM files the Evaluator uses both, the sample genome and annotation, and the SIRVome with different predefined annotations, for transcript assembly and abundance estimation. i4, data can be uploaded as .CSV files which contain transcript ID’s and TPM values (Transcripts per Million Transcripts).

Transcript Assembly and Abundance Estimation

As one example, the program algorithm of Cufflinks (Trapnell et al. 2010) is currently preinstalled. Soon, additional programs and algorithms will be installed. Alternatively, other transcript assembly programs can be embedded into SIRV Suite installation clones by the user as part of the Galaxy pipeline customization. The resulting transcript assemblies and abundance estimations are .CSV files with a list of transcript ID’s and TPM values.


First, although TPM values are normalised relative concentration values some algorithms produce artefactual data sets for which a normalization converter has been programmed. It carries out an automated sanity check and identifies prominent outliers. The basis for the TPM artefact identification are trumpet plots of technical replicates which are shown in one representative average figure as part of the report. If a removal of artefacts is required it can lead to a minor rescaling of the TPM values. Detailed information about the process can be found in the SIRV Suite comments.

Second, TPM values below a relative quantity threshold are set to a lower threshold value. This threshold is set manually. The rational ground depends on the read depth and the granularity of digital reads. According to the definition of the FPKM value (fragments per kilobase of exon per million fragments mapped) each identified transcript, or group of isoforms, can have one or more mapped fragments, or none. Although probabilities can be assigned to isoforms which share the same read sequence, leading to fractions of fragments, a lower threshold will set a nonzero baseline which is experimentally indistinguishable from true zero. This is not only important for the correct interpretation of the results but also for calculating ratios of expression values of transcripts below detection level. Changes in read depths provide arguments for changing thresholds.

Attention! Different threshold settings can affect the calculation of fold-changes of transcripts in the low abundance range.

Third, SIRV reads are normalized as such that the measured and the expected sum of molecules in the SIRV mix are identical. By this means the comparison of relative and absolute concentration measures are uncoupled. Absolute read counts are used separately in the read count statistics to measure, e.g., mRNA content, or technical variability.

Quality Metrics

Using the read statistics the Evaluator calculates:

  • Ratio between expected and measured reads relative to the reads from the endogenous RNA, which is interpreted as, e.g., mRNA content, or experimental variability,

The Evaluator generates the experiment quality metrics and visualizations which include:

  • Compact Coverage Visualization (CCV) graphs which provide an overview of the normalized expected and measured coverages,
  • Calculating Coefficients of Deviation (CoD) as a measure for the deviation between measured and expected coverages,
  • Table with normalized counts of identified vs. expect telling junction reads,

Based on the scaled TPM values the Evaluator calculates further:

  • Precision as the mean of all relative SIRV standard deviations (RSD),
  • Accuracy and Differential Accuracy as the median of all log2-fold changes moduli between the measured and the expected relative SIRV concentrations,

The interface of the Evaluator allows to call the experimental data and define the quality metrics which are calculated, and shown in the report.


Figure 4 ǀ SIRV Suite Evaluator. Screenshot showing the form to enter the experiment which links the NGS data to the meta data from the Experiment Specifier, and dials the data evaluation which is entering the report.

The following chapters explain the quality metrics.

mRNA contents

On the basis of the assumption that the endogenous RNA and the spike-in controls are equally targeted by the library preparation due to similar length distributions (mean length of the SIRVs is 1134 nt, UHRR experiments show an average length of 1550 nt,) the relative mass partition between controls and endogenous RNA can be determined. The propagation of the input amounts to the output read ratio depends on the mRNA content, the relative recovery efficiencies of controls and mRNA, and the experimental variability of spiking a sample with controls.

Compact Coverage Visualizations (CCV)

CCV’s show all expected SIRV coverages and counts of the telling junctions together with the measured coverages as obtained by the respective mapping in a scaled and compact format. Mutual intron sequences are reduced to a small gaps of the same length. This visualization focus on relevant well covered sequences as the small subset of seven representative gens provides one unique comparable overview to carry out a first sanity check of the experiment’s performance. In contrast, inspecting randomly chosen genes, which could be affected by differential expression, is no systematic approach for comparing experiments. The measured and expected junction counts are scaled as such that they add up to always the same summary which makes them easier comparable. All mutual exclusive telling junction reads have the expected normalized read count of 1 in the SIRV Mix E0, and so forth. Over- and under-represented junctions can be seen immediately.

Coefficient of Deviation (CoD)

For the first time the ground truth of complex input sequences is known allowing detailed target-performance comparisons for read alignment, relative abundance calculation, and differential expression measurements. NGS workflow-specific read start-site distributions lead to coverage patterns with inherent terminal deficiencies for which the expected coverage has been adjusted accordingly. However, in the measured coverages these systematic start- and end-site biases are accompanied by a variety of biases which introduce severe local deviations from the expected coverage. To obtain a comparative measure, gene-specific coefficients of deviation, CoD, are calculated, and presented for the plus strand annotation (+), and for the minus strand orientation (-). The mean of CoD values from all 7 genes is combined to one combined CoD value, one CoD value for the sense transcripts (fwd), and one CoD value for the less frequently occurring and simpler antisense transcripts (rev).

CoD’s describe the often hidden biases in the sequence data predominantly caused by an inhomogeneous library preparation, but also by the subsequent sequencing and mapping. The coverage target-performance comparisons highlight the inherent difficulties in deconvoluting read distributions to correctly identify transcript variants and determine concentrations. The distribution of telling reads, splice junctions, and reads towards the termini are references for the assignment of the remaining reads before calculating relative transcript variant abundances. Not surprisingly, although the majority of junction counts might correlate well with the expected values, higher CoD values would indicate that numerous telling reads deviate significantly within the context of the individual mixes. If CoD values also differ between mixes the coverage will affect differential expression measurements.

The CoD does not allow to distinguish between periodicity and randomness in the biases nor does it forecast how well a data evaluation pipeline can cope with bias contributions. Nevertheless, smaller CoD values are expected to correlate with a simpler and less error-prone data evaluation. The CoD values can be taken as a first, indicative measure to characterize the mapped data, and to compare data sets for similarity right up to this point in the workflow.

Precision (Pre)

measures the scatter of calculated abundance values. Using the technical replicates, the relative standard deviation (RSD), or coefficient of variation (CV), of log2-fold changes (LFC) between the measured and the expect values are calculated for each SIRV transcript. The precision is the mean of all standard deviations of all SIRV RSD’s. The individual SIRV concentrations cover a narrow concentration range of two orders of magnitude. Therefore, the precision is not dominated by low abundant much more scattered species. However, because the SIRV controls can be spiked in using different relative amounts the precision of different concentration ranges can be probed with adequately designed spike-in experiments. Precision can also be determined using the RSD values of endogenous RNA in the concentration range of interest.

Accuracy (Acc)

measures the deviation of the calculated abundance values from the expected value. Accuracy can only be measured in comparison to known the controls. The accuracy is the median of all LFC moduli. LFC moduli consider relative increases and decreases across the probed concentration range. The accuracy shows the average fold-deviation between measured and expected values. Although median, mean, and standard deviation of the LFC moduli are presented in the report as one way to describe the distribution of values, the median is the most robust value against the extent of outliers which can shift when changing threshold settings.

Boxplot correlations highlight the issue of obvious and frequent outliers, i.e., transcript variants that are not well resolved. However, only the heat map allows to inspect each SIRV in the context of competing transcripts, and shows the abundancies as LFC relative to the expected values. An LFC window of ±0.11 presents the SIRV confidence interval as a result from the currently achievable accuracy in producing the SIRV mixtures (read more about producing SIRV mixes in FAQ section). Beside the boxplots standard correlation plots including Pearson p-values and R2-values are presented in the report to see the distribution of calculated concentration values in an alternative overview version.

Differential Accuracy (diff. Acc)

measures the deviation of the calculated abundance LFC’s from the expected LFC. The differential accuracy is the median of all LFC moduli of the measured differential expression (DE) value versus the expected DE values. It can only be calculated for the difference between the different SIRV mixes E0, E1, and E2. The expected LFC’s range from 1/64 (submix 3 between E2 and E1) to 16 (submix 1 between E2 and E1). The highest measurable LFC values are determined by the threshold level.

Experiment Comparator

CoD, precision and accuracy are independent quality metrics for the description of NGS pipelines during validation experiments, and the characterization of individual experiments. Important as a reminder, different experiments are not only characterized by different input materials but also any change in the data generation and evaluation pipeline. The quality metrics above are derived by comparing the experimental results to the expected outcome. Although it is important to monitor absolute rankings during method development, for the comparison of experimental data the crucial parameter is not the extent of biases in experiments but the bias consistency. The Comparator determines the difference between experiments based on the consistent condensed complexity of the SIRVs. The order of comparing experiments to each other can be chosen starting from pairwise comparison up to searching an entire data base (see Data Libraries below).


Figure 5 ǀ SIRV Suite Comparator flow scheme. The control data set is used to carry out pairwise comparisons between experiments using the small subset of SIRV control data. Data sets of high concordance can be filtered, and the extent of expected error rates can be estimated, before making a decision about comparing the complete data sets.

The following pairwise comparison values are calculated.

Pairwise Coefficient of Deviation (CoDN1N2)

Alike the above introduced CoD value for one experiment the CoDN1N2 is calculated by comparing the normalized coverages of experiments N1 and N2. Identical biases lead to small CoDN1N2 values approaching zero in an ideal case.

Concordance (Con)

Based on the .CSV files containing normalized TPM values the Comparator calculates the concordance values. The concordance is the median of all LFC moduli calculated for SIRVs in two experiments, which is essentially the relative accuracy measure calculated by comparing two experiments to each other. High concordances are represented by small values.

Again, the SIRV Suite contains the Comparator as separate program module.


Figure 6 ǀ Comparator data input interface. The screen shot shows the section for entering the data file names which are selected for comparing the control data to each other.

Data Libraries

The data base contains SIRV controls reference data sets, and accepts new .BAM, (.BED), and .CSV files data sets with the quality metrics together with the metadata from the experiment. The tool facilitates comparing data between working groups and laboratories to identify, and subsequently request potential collaborations. By these means, meta-analyses of data sets are carried out only with data sets which meet certain quality criteria, and carry similar biases.


Figure 7 ǀ SIRV SUITE Data Libraries. The screenshot shows uploaded SIRV data. The data can be searched for similarities in the meta data, and compared to each other by using the Comparator. The SIRV concordance values provide a rational argument for starting comparisons of the large accompanying data sets of the endogenous RNA.

The data library is managed by the Galaxy instance.

SIRV Suite Data Policy

The data processing service provided by Lexogen GmbH, including SIRV Suite, is a free, public, Internet accessible resource (the “Service”). Data transfer and data storage are not encrypted. If there are restrictions on the way your research data can be stored and used, please consult your local institutional review board or the project principal investigator before uploading it to any public site, including this Service. If you have protected data, large data storage requirements, or short deadlines you are encouraged to set up your own local SIRV Suite instance and not use this Service. Your access to the service may be revoked at any time for reasons deemed necessary by the operators of the Service.

You may choose to register an account with the Service. Your registration data is primarily used so you may persistently store data on the Service and use advanced SIRV Suite features such as sharing and workflows. The operators of the Service will not provide your registration data to any third party unless required to do so by law. Your access to the Service is provided under the condition that you abide by any published quotas on data storage, job submissions, or any other limitations placed on the public Service. Attempts to subvert these limits by creating multiple accounts or through any other method may result in termination of all associated accounts.

The Service is provided to you on an “AS IS” BASIS and WITHOUT WARRANTY, either express or implied, including, without limitation, the warranties of non-infringement, merchantability or fitness for a particular purpose. THE ENTIRE RISK AS TO THE QUALITY OF THE SERVICE IS WITH YOU. This DISCLAIMER OF WARRANTY constitutes an essential part of this service agreement.

Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall Lexogen GmbH be liable to anyone for any indirect, special, incidental, or consequential damages of any character arising as a result of the use of this Service including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.