Mix2 RNA-Seq Data Analysis Software
A software tool for the accurate estimation of RNA concentration from RNA-Seq data.
Fragment bias in RNA-Seq poses a serious challenge to the accurate quantification of gene isoforms. Mix2 makes no assumptions about coverage bias but fits for each gene isoform a mixture model to the data (Fig. 1). Mix2 can therefore, for instance, accurately represent the 5’ bias, as shown in Fig. 1 (a and b), whereas Cufflinks is restricted to the uniform distribution (Fig. 1c).
Figure 1 | Exemplary representation for positional fragment bias over a 2000 bps transcript modeled with a mixture of 8 normal distributions. (a) the green curve shows the combined probability density function over the whole transcript, while the blue curves show the individual mixture distributions. (b) and (c) panels display fragment distributions in a locus with two transcripts sharing one junction, as modeled by Mix2 or Cufflinks. Long and short transcripts start at 5000 and 5500 bp from the beginning of the locus, and are 2000 and 1000 bp long, respectively. The junction spans the 6000 – 6499 bp region.
The Mix2 software yields accurate isoform quantification from RNA-Seq data
Implementation and run-time performance
The Mix2 software runs as a 64-bit Linux command line tool. For an up-to-date list of supported distributions please refer to the User Guide of the Mix2 software.
|Mix2||Cufflinks w/o bias correction||Cufflinks with bias correction|
Table 1 | Memory usage and average run-time statistics on the MAQC UHR and HBR datasets. Min stands for run-time in minutes, GB for memory usage in gigabytes. xRT and xMEM are the factors by which run-time and memory usage increases, respectively, in comparison to Mix2.
Mix2 was tested on the publicly available MicroArray Quality Control (MAQC)  and Association of Biomolecular Resource Facilities (ABRF)  datasets, containing RNA-Seq data from multiple sequencing facilities and library preparations which started with differently degraded RNA.
The higher accuracy of the concentration estimates of Mix2 leads to better correlation between qPCR and FPKM fold-changes and consequently to higher accuracy in the detection of differential expression (Fig. 2).
Figure 2 | Correlation between qPCR and FPKM fold changes between UHR and HBR RNA for Mix2 vs Cufflinks, and the ROC curve for a classification experiment based on FPKM values of UHR and HBR RNA lanes. Since the FPKM and qPCR fold changes should be identical, the range of FPKM fold changes was restricted to the range of qPCR values, as shown in (a) and (b), and thus to a range between 10-4 and 103. (b) Cufflinks produces a large number of transcripts whose FPKM fold change lies considerably above or below the majority, as can be seen by the long straight clusters at FPKM fold changes of 10-4 and 103. The Mix2 model, on the other hand, greatly improves the correlation between qPCR and FPKM fold changes for the UHR and HBR RNA samples, and as shown in the classification experiment (c) leads to a substantially higher accuracy in the detection of differential expression. The dotted line in (c) indicates a false positive rate of 0.1.