i1, and i2, because tools for trimming and mapping are currently not preinstalled the read processing and mapping must be carried out in a separate workflow, or coupled Galaxy project. i3, starting with mapped reads in .BAM files the Evaluator uses both, the sample genome and annotation, and the SIRVome with different predefined annotations, for transcript assembly and abundance estimation. i4, data can be uploaded as .CSV files which contain transcript ID’s and TPM values (Transcripts per Million Transcripts).
Transcript Assembly and Abundance Estimation
As one example, the program algorithm of Cufflinks (Trapnell et al. 2010) is currently preinstalled. Soon, additional programs and algorithms will be installed. Alternatively, other transcript assembly programs can be embedded into SIRV Suite installation clones by the user as part of the Galaxy pipeline customization. The resulting transcript assemblies and abundance estimations are .CSV files with a list of transcript ID’s and TPM values.
First, although TPM values are normalised relative concentration values some algorithms produce artefactual data sets for which a normalization converter has been programmed. It carries out an automated sanity check and identifies prominent outliers. The basis for the TPM artefact identification are trumpet plots of technical replicates which are shown in one representative average figure as part of the report. If a removal of artefacts is required it can lead to a minor rescaling of the TPM values. Detailed information about the process can be found in the SIRV Suite comments.
Second, TPM values below a relative quantity threshold are set to a lower threshold value. This threshold is set manually. The rational ground depends on the read depth and the granularity of digital reads. According to the definition of the FPKM value (fragments per kilobase of exon per million fragments mapped) each identified transcript, or group of isoforms, can have one or more mapped fragments, or none. Although probabilities can be assigned to isoforms which share the same read sequence, leading to fractions of fragments, a lower threshold will set a nonzero baseline which is experimentally indistinguishable from true zero. This is not only important for the correct interpretation of the results but also for calculating ratios of expression values of transcripts below detection level. Changes in read depths provide arguments for changing thresholds.
Attention! Different threshold settings can affect the calculation of fold-changes of transcripts in the low abundance range.
Third, SIRV reads are normalized as such that the measured and the expected sum of molecules in the SIRV mix are identical. By this means the comparison of relative and absolute concentration measures are uncoupled. Absolute read counts are used separately in the read count statistics to measure, e.g., mRNA content, or technical variability.
Using the read statistics the Evaluator calculates:
- Ratio between expected and measured reads relative to the reads from the endogenous RNA, which is interpreted as, e.g., mRNA content, or experimental variability,
The Evaluator generates the experiment quality metrics and visualizations which include:
- Compact Coverage Visualization (CCV) graphs which provide an overview of the normalized expected and measured coverages,
- Calculating Coefficients of Deviation (CoD) as a measure for the deviation between measured and expected coverages,
- Table with normalized counts of identified vs. expect telling junction reads,
Based on the scaled TPM values the Evaluator calculates further:
- Precision as the mean of all relative SIRV standard deviations (RSD),
- Accuracy and Differential Accuracy as the median of all log2-fold changes moduli between the measured and the expected relative SIRV concentrations,
The interface of the Evaluator allows to call the experimental data and define the quality metrics which are calculated, and shown in the report.
Figure 4 ǀ SIRV Suite Evaluator. Screenshot showing the form to enter the experiment which links the NGS data to the meta data from the Experiment Specifier, and dials the data evaluation which is entering the report.
The following chapters explain the quality metrics.
On the basis of the assumption that the endogenous RNA and the spike-in controls are equally targeted by the library preparation due to similar length distributions (mean length of the SIRVs is 1134 nt, UHRR experiments show an average length of 1550 nt,) the relative mass partition between controls and endogenous RNA can be determined. The propagation of the input amounts to the output read ratio depends on the mRNA content, the relative recovery efficiencies of controls and mRNA, and the experimental variability of spiking a sample with controls.
Compact Coverage Visualizations (CCV)
CCV’s show all expected SIRV coverages and counts of the telling junctions together with the measured coverages as obtained by the respective mapping in a scaled and compact format. Mutual intron sequences are reduced to a small gaps of the same length. This visualization focus on relevant well covered sequences as the small subset of seven representative gens provides one unique comparable overview to carry out a first sanity check of the experiment’s performance. In contrast, inspecting randomly chosen genes, which could be affected by differential expression, is no systematic approach for comparing experiments. The measured and expected junction counts are scaled as such that they add up to always the same summary which makes them easier comparable. All mutual exclusive telling junction reads have the expected normalized read count of 1 in the SIRV mix E0, and so forth. Over- and under-represented junctions can be seen immediately.
Coefficient of Deviation (CoD)
For the first time the ground truth of complex input sequences is known allowing detailed target-performance comparisons for read alignment, relative abundance calculation, and differential expression measurements. NGS workflow-specific read start-site distributions lead to coverage patterns with inherent terminal deficiencies for which the expected coverage has been adjusted accordingly. However, in the measured coverages these systematic start- and end-site biases are accompanied by a variety of biases which introduce severe local deviations from the expected coverage. To obtain a comparative measure, gene-specific coefficients of deviation, CoD, are calculated, and presented for the plus strand annotation (+), and for the minus strand orientation (-). The mean of CoD values from all 7 genes is combined to one combined CoD value, one CoD value for the sense transcripts (fwd), and one CoD value for the less frequently occurring and simpler antisense transcripts (rev).
CoD’s describe the often hidden biases in the sequence data predominantly caused by an inhomogeneous library preparation, but also by the subsequent sequencing and mapping. The coverage target-performance comparisons highlight the inherent difficulties in deconvoluting read distributions to correctly identify transcript variants and determine concentrations. The distribution of telling reads, splice junctions, and reads towards the termini are references for the assignment of the remaining reads before calculating relative transcript variant abundances. Not surprisingly, although the majority of junction counts might correlate well with the expected values, higher CoD values would indicate that numerous telling reads deviate significantly within the context of the individual mixes. If CoD values also differ between mixes the coverage will affect differential expression measurements.
The CoD does not allow to distinguish between periodicity and randomness in the biases nor does it forecast how well a data evaluation pipeline can cope with bias contributions. Nevertheless, smaller CoD values are expected to correlate with a simpler and less error-prone data evaluation. The CoD values can be taken as a first, indicative measure to characterize the mapped data, and to compare data sets for similarity right up to this point in the workflow.
measures the scatter of calculated abundance values. Using the technical replicates the relative standard deviation (RSD), or coefficient of variation (CV), of log2-fold changes (LFC) between the measured and the expect values are calculated for each SIRV. The precision is the mean of all standard deviations of all SIRV RSD’s. The individual SIRV concentrations cover a narrow concentration range of two orders of magnitude. Therefore the precision is not dominated by low abundant much more scattered species. However, because the SIRV controls can be spiked in using different relative amounts the precision of different concentration ranges can be probed with adequately designed spike-in experiments. Precision can also be determined using the RSD values of endogenous RNA in the concentration range of interest.
measures the deviation of the calculated abundance values from the expected value. Accuracy can only be measured in comparison to known the controls. The accuracy is the median of all LFC moduli. LFC moduli consider relative increases and decreases across the probed concentration range. The accuracy shows the average fold-deviation between measured and expected values. Although median, mean, and standard deviation of the LFC moduli are presented in the report as one way to describe the distribution of values, the median is the most robust value against the extent of outliers which can shift when changing threshold settings.
Boxplot correlations highlight the issue of obvious and frequent outliers, i.e., transcript variants that are not well resolved. However, only the heat map allows to inspect each SIRV in the context of competing transcripts, and shows the abundancies as LFC relative to the expected values. An LFC window of ±0.11 presents the SIRV confidence interval as a result from the currently achievable accuracy in producing the SIRV mixtures (read more about producing SIRV mixes in FAQ section). Beside the boxplots standard correlation plots including Pearson p-values and R2-values are presented in the report to see the distribution of calculated concentration values in an alternative overview version.
Differential Accuracy (diff. Acc)
measures the deviation of the calculated abundance LFC’s from the expected LFC. The differential accuracy is the median of all LFC moduli of the measured differential expression (DE) value versus the expected DE values. It can only be calculated for the difference between the different SIRV mixes E0, E1, and E2. The expected LFC’s range from 1/64 (submix 3 between E2 and E1) to 16 (submix 1 between E2 and E1). The highest measureable LFC values are determined by the threshold level.