A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

Dana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese, Shan Jiang, Sorena Rahmanian, Stefania Forner, Dina Matheos, Weihua Zeng, Brian Williams, Diane Trout, Whitney England, Shu-Hui Chu, Robert C. Spitale, Andrea J. Tenner, Barbara J. Wold, Ali Mortazavi

bioRxiv, doi:10.1101/672931

Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short reads. Here we introduce TALON, the ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes. We apply TALON to the GM12878 cell line and show that while both PacBio and ONT technologies perform well at full-transcript discovery and quantification, each displayed distinct technical artifacts. We further apply TALON to mouse hippocampus and cortex transcriptomes and find that 422 genes found in these regions have more reads associated with novel isoforms than with annotated ones. We demonstrate that TALON is a capable of tracking both known and novel transcript models as well as their expression levels across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

Pi-starvation induced transcriptional changes in barley revealed by a comprehensive RNA-Seq and degradome analyses

Pawel Sega, Katarzyna Kruszka, Dawid Bielewicz, Wojciech Karlowski, Przemyslaw Nuc, Zofia Szweykowska-Kulinska, Andrzej Pacak

ResearchSquare, doi:10.21203/rs.2.24665/v1

Background: Small RNAs (sRNAs) are 18–24 nt regulatory elements which are responsible for plant development regulation and participate in many plant stress responses. Insufficient inorganic phosphate (Pi) concentration triggers plant responses to balance the internal Pi level.

Results: In this study, we describe Pi-starvation-responsive small RNAs and transcriptome changes in barley (Hordeum vulgare L.) using Next-Generation Sequencing (NGS) data derived from three different types of NGS libraries: (i) small RNAs, (ii) degraded RNAs, and (iii) functional mRNAs.  We find that differentially and significantly expressed miRNAs (DEMs, p-value < 0.05) are represented by 162 (44.88 % of total differentially expressed small RNAs) molecules in shoot and 138 (7.14 %) in root; mainly various miR399 and miR827 isomiRs. The remaining small RNAs (i.e., those without perfect match to reference sequences deposited in miRBase) are considered as differentially expressed other sRNAs (DESs, Bonferroni correction). In roots, a more abundant and diverse set of other sRNAs (1796 unique sequences, 0.13 % from total unique reads obtained under low-Pi) contributes more to the compensation of low-Pi stress than that in shoots (199 unique sequences, 0.01 %). More than 80 % of differentially expressed other sRNAs are upregulated in both organs. Additionally, in barley shoots, upregulation of small RNAs is accompanied by strong induction of two nucleases (S1/P1 endonuclease and 3’-5’ exonuclease). This suggests that most small RNAs may be generated upon endonucleolytic cleavage to increase the internal Pi pool. Transcriptomic profiling of Pi-starved barley shoots identify 98 differentially expressed genes (DEGs). A majority of the DEGs possess characteristic Pi-responsive cis-regulatory elements (P1BS and/or PHO element), located mostly in the proximal promoter regions. GO analysis shows that the discovered DEGs primarily alter plant defense, plant stress response, nutrient mobilization, or pathways involved in the gathering and recycling of phosphorus from organic pools.

Conclusions: Our results provide comprehensive data to demonstrate complex responses at the RNA level in barley to maintain Pi homeostasis and indicate that barley adapts to Pi scarcity through elicitation of RNA degradation. Novel P-responsive genes were selected as putative candidates to overcome low-Pi stress in barley plants.

Features SIRVs (Spike-in RNA Variant Control Mixes) and SENSE mRNA-Seq Library Prep Kit

Reference-free reconstruction and quantification of transcriptomes from long-read sequencing

Ivan de la Rubia, Joel A. Indi, Silvia Carbonell, Julien Lagarde, M Mar Albà, Eduardo Eyras

bioRxiv, doi:10.1101/2020.02.08.939942

Single-molecule long-read sequencing provides an unprecedented opportunity to measure the transcriptome from any sample. However, current methods for the analysis of transcriptomes from long reads rely on the comparison with a genome or transcriptome reference, or use multiple sequencing technologies. These approaches preclude the cost-effective study of species with no reference available, and the discovery of new genes and transcripts in individuals underrepresented in the reference. Methods for the assembly of DNA long-reads cannot be directly transferred to transcriptomes since their consensus sequences lack the interpretability as genes with multiple transcript isoforms. To address these challenges, we have developed RATTLE, the first method for the reference-free reconstruction and quantification of transcripts from long reads. Using simulated data, transcript isoform spike-ins, and sequencing data from human and mouse tissues, we demonstrate that RATTLE accurately performs read clustering and error-correction. Furthermore, RATTLE predicts transcript sequences and their abundances with accuracy comparable to reference-based methods. RATTLE enables rapid and cost-effective long-read transcriptomics in any sample and any species, without the need of a genome or annotation reference and without using additional technologies.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited it scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error-correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain an accuracy of 98.7-99.5%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

The full-length transcriptome of C. elegans using direct RNA sequencing

Nathan P. Roach, Norah Sadowski, Amelia F. Alessi, Winston Timp, James Taylor, and John K. Kim

Genome Research, doi:10.1101/gr.251314.119

Current transcriptome annotations have largely relied on short read lengths intrinsic to the most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans. Taking advantage of long reads spanning the full length of mRNA transcripts, we provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. Of the isoforms identified, 3452 are novel splice isoforms not present in the WormBase WS265 annotation. Furthermore, we identified 16,342 isoforms in the 3′ untranslated region (3′ UTR), 2640 of which are novel and do not fall within 10 bp of existing 3′-UTR data sets and annotations. Combining 3′ UTRs and splice isoforms, we identified 28,858 full-length transcript isoforms. We also determined that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly(A) tail length and expression level, and poly(A) tail length and 3′-UTR length. Finally, we have formatted this data as a publicly accessible track hub, enabling researchers to explore this data set easily in a genome browser.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

Principles of Immunopharmacology provides a unique source of essential knowledge on the immune response, its diagnosis and its modification by drugs and chemicals. The 4th edition of this internationally recognized textbook has been revised to include recent developments, but continues the established format, dealing with four related fields in a single volume, thus obviating the need to refer to several different textbooks.

The first section of the book, providing a basic introduction to immunology and its relevance for human disease, has been updated to accommodate new immunological concepts, particularly the role of epigenetics and the latest understanding of cancer immunology. The second section on immunodiagnostics offers a topical description of widely used molecular techniques and a new chapter on imaging techniques. This is followed by a systematic coverage of drugs affecting the immune system, including natural products. This third section contains 15 updated chapters, covering classical immunopharmacological topics such as anti-asthmatic, anti-rheumatic and immunosuppressive drugs, but also deals with antibiotics, plant-derived and dietary agents, with new chapters on monoclonal antibodies, immunotherapy in sepsis and infection, drugs for soft-tissue autoimmunity and cell therapy. The book concludes with a chapter on immunotoxicology and drug safety tests.

Aids to the reader include a two-column format, glossaries of technical terms and appendix reference tables. The emphasis on illustrations is maintained from the first three editions.

The book is a valuable single reference for undergraduate and graduate medical and biomedical students, postgraduate chemistry and pharmacy students, researchers in chemistry, biochemistry and the pharmaceutical industry and researchers lacking basic immunological knowledge, who want to understand the actions of drugs on the immune system.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

The complexity in biological data reflects the heterogeneous nature of biological processes. Computational methods need to preserve as much information regarding the biological process of interest as possible. In this work, we explore three specific tasks about resolving biological heterogeneity. The first task is to infer heterogeneous phylogenetic relationship using molecular data. The common likelihood models for phylogenetic inference often makes strong assumptions about the evolution process across different lineages and different mutation sites. We use convolutional neural network to infer phylogenies instead, allowing the model to describe more heterogeneous evolution process. The model outperformes commonly used algorithms on diverse simulation datasets. The second task is to infer the clonal composition and phylogeny from bulk DNA sequencing data of tumour samples. Estimating clonal information from bulk data often involves resolving mixture models. Unfortunately, simpler models are often unable to capture complex genetic alteration events in tumour cells, while more sophisticated models incur heavy computational burdens and are hard to converge. We solve the challenge through density-hinted optimization with post hoc adjustment. The model makes conservative predications but yields better accuracy in assessing co-clustering relationship among the somatic mutations. The third task is to estimate the abundance of splicing transcripts from full-length single-cell RNA sequencing data. Transcript inference from RNA sequencing data needs a plethora of reads for accurate abundance estimation. Yet single-cell sequencing yields much fewer reads than bulk sequencing. To recover transcripts from full-length single-cell RNA sequencing data, we pool reads from similar cells to help assign transcripts without disrupting the cluster structures. These methods describe complex biological processes with minimal runtime overhead. Taking these methods as examples, we will briefly discuss the rationale and some general principals in designing these methods.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

SMUG1 Promotes Telomere Maintenance through Telomerase RNA Processing

Penelope Kroustallaki, Lisa Lirussi,Sergio Carracedo, Panpan You, Q. Ying Esbensen, Alexandra Götz, Laure Jobert, Lene Alsøe, Pål Sætrom, Sarantis Gagos, Hilde Nilsen

Cell Reports, doi:10.1016/j.celrep.2019.07.040

Telomerase biogenesis is a complex process where several steps remain poorly understood. Single-strand-selective uracil-DNA glycosylase (SMUG1) associates with the DKC1-containing H/ACA ribonucleoprotein complex, which is essential for telomerase biogenesis. Herein, we show that SMUG1 interacts with the telomeric RNA component (hTERC) and is required for co-transcriptional processing of the nascent transcript into mature hTERC. We demonstrate that SMUG1 regulates the presence of base modifications in hTERC, in a region between the CR4/CR5 domain and the H box. Increased levels of hTERC base modifications are accompanied by reduced DKC1 binding. Loss of SMUG1 leads to an imbalance between mature hTERC and its processing intermediates, leading to the accumulation of 3′-polyadenylated and 3′-extended intermediates that are degraded in an EXOSC10-independent RNA degradation pathway. Consequently, SMUG1-deprived cells exhibit telomerase deficiency, leading to impaired bone marrow proliferation in Smug1-knockout mice.

Features SIRVs (Spike-in RNA Variant Control Mixes) and SENSE Total RNA-Seq Library Prep Kit

ARID1A and PI3-kinase pathway mutations in the endometrium drive epithelial transdifferentiation and collective invasion

Mike R. Wilson, Jake J. Reske, Jeanne Holladay, Genna E. Wilber, Mary Rhodes, Julie Koeman, Marie Adams, Ben Johnson, Ren-Wei Su, Niraj R. Joshi, Amanda L. Patterson, Hui Shen, Richard E. Leach, Jose M. Teixeira, Asgerally T. Fazleabas & Ronald L. Chandler

Nature Communications, doi:10.1038/s41467-019-11403-6

ARID1A and PI3-Kinase (PI3K) pathway alterations are common in neoplasms originating from the uterine endometrium. Here we show that monoallelic loss of ARID1A in the mouse endometrial epithelium is sufficient for vaginal bleeding when combined with PI3K activation. Sorted mutant epithelial cells display gene expression and promoter chromatin signatures associated with epithelial-to-mesenchymal transition (EMT). We further show that ARID1A is bound to promoters with open chromatin, but ARID1A loss leads to increased promoter chromatin accessibility and the expression of EMT genes. PI3K activation partially rescues the mesenchymal phenotypes driven by ARID1A loss through antagonism of ARID1A target gene expression, resulting in partial EMT and invasion. We propose that ARID1A normally maintains endometrial epithelial cell identity by repressing mesenchymal cell fates, and that coexistent ARID1A and PI3K mutations promote epithelial transdifferentiation and collective invasion. Broadly, our findings support a role for collective epithelial invasion in the spread of abnormal endometrial tissue.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes

Charlotte Soneson, Yao Yao, Anna Bratus-Neuenschwander, Andrea Patrignani, Mark D. Robinson & Shobbir Hussain

Nature Communications, doi:10.1038/s41467-019-11272-z

A platform for highly parallel direct sequencing of native RNA strands was recently described by Oxford Nanopore Technologies, but despite initial efforts it remains crucial to further investigate the technology for quantification of complex transcriptomes. Here we undertake native RNA sequencing of polyA + RNA from two human cell lines, analysing ~5.2 million aligned native RNA reads. To enable informative comparisons, we also perform relevant ONT direct cDNA- and Illumina-sequencing. We find that while native RNA sequencing does enable some of the anticipated advantages, key unexpected aspects currently hamper its performance, most notably the quite frequent inability to obtain full-length transcripts from single reads, as well as difficulties to unambiguously infer their true transcript of origin. While characterising issues that need to be addressed when investigating more complex transcriptomes, our study highlights that with some defined improvements, native RNA sequencing could be an important addition to the mammalian transcriptomics toolbox.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

Targeting destabilized DNA G-quadruplexes and aberrant splicing in drug-resistant glioblastoma

Deanna M Tiek, Roham Razaghi, Lu Jin, Norah Sadowski, Carla Alamillo-Ferrer, J Robert Hogg, Bassem R Haddad, David H Drewry, Carrow I Wells, Julie E. Pickett, William J Zuercher, Winston Timp, Rebecca B Riggins

BioRxiv, doi:10.1101/661660

Temozolomide (TMZ) is a chemotherapy agent that adds mutagenic adducts to guanine, and is first-line standard of care for the aggressive brain cancer glioblastoma (GBM). Methyl guanine methyl transferase (MGMT) is a DNA repair enzyme that can remove O6-methyl guanine adducts prior to the development of catastrophic mutations, and is associated with TMZ resistance. However, inhibition of MGMT fails to reverse TMZ resistance. Guanines are essential nucleotides in many DNA and RNA secondary structures. In several neurodegenerative diseases (NDs), disruption of these secondary structures is pathogenic. We therefore took a structural view of TMZ resistance, seeking to establish the role of guanine mutations in disrupting critical nucleotide secondary structures. To test whether these have functional impacts on TMZ-resistant GBM, we focused on two specific guanine-rich regions: G-quadruplexes (G4s) and splice sites. Here we report broad sequence- and conformation-based changes in G4s in acquired or intrinsic TMZ resistant vs. sensitive GBM cells, accompanied by nucleolar stress and enrichment of nucleolar RNA:DNA hybrids (r-loops). We further show widespread splice-altering mutations, exon skipping, and deregulation of splicing-regulatory serine/arginine rich (SR) protein phosphorylation in TMZ-resistant GBM cells. The G4-stabilizing ligand TMPyP4 and a novel inhibitor of cdc2-like kinases (CLKs) partially normalize G4 structure and SR protein phosphorylation, respectively, and are preferentially growth-inhibitory in TMZ-resistant cells. Lastly, we report that the G4- and RNA-binding protein EWSR1 forms aberrant cytoplasmic aggregates in response to acute TMZ treatment, and these aggregates are abundant in TMZ resistant cells. Preliminary evidence suggests these cytoplasmic EWSR1 aggregates are also present in GBM clinical samples. This work supports altered nucleotide secondary structure and splicing deregulation as pathogenic features of TMZ-resistant GBM. It further positions cytoplasmic aggregation of EWSR1 as a potential indicator for TMZ resistance, establishes the possibility of successful intervention with splicing modulatory or G4-targeting agents, and provides a new context in which to study aggregating RNA binding proteins.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

ORF Capture-Seq: a versatile method for targeted identification of full-length isoforms

Gloria M. Sheynkman, Katharine S. Tuttle, Elizabeth Tseng, Jason G. Underwood, Liang Yu, Da Dong, Melissa L. Smith, Robert Sebra, Tong Hao, Michael A. Calderwood, David E. Hill, Marc Vidal

BioRxiv, doi:10.1101/604157

Most human protein-coding genes are expressed as multiple isoforms. This in turn greatly expands the functional repertoire of the encoded proteome. While at least one reliable open reading frame (ORF) model has been assigned for every gene, the majority of alternative isoforms remains uncharacterized experimentally. This is primarily due to: i) vast differences of overall levels between different isoforms expressed from common genes, and ii) the difficulty of obtaining contiguous full-length ORF sequences. Here, we present ORF Capture-Seq (OCS), a flexible and cost-effective method that addresses both challenges for targeted full-length isoform sequencing applications using collections of cloned ORFs as probes. As proof-of-concept, we show that an OCS pipeline focused on genes coding for transcription factors increases isoform detection by an order of magnitude, compared to unenriched sample. In short, OCS enables rapid discovery of isoforms from custom-selected genes and will allow mapping of the full set of human isoforms at reasonable cost.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1, 2 & 3

Human genes have numerous exons that are differentially spliced within pre-mRNA. Understanding how multiple splicing events are coordinated across nascent transcripts requires quantitative analyses of transient RNA processing events in living cells. We developed nanopore analysis of CO-transcriptional Processing (nano-COP), in which nascent RNAs are directly sequenced through nanopores, exposing the dynamics and patterns of RNA splicing without biases introduced by amplification. nano-COP showed that in both human and Drosophila cells, co-transcriptional splicing occurs after RNA polymerase II transcribes several kilobases of pre-mRNA, suggesting that metazoan splicing transpires distally from the transcription machinery. Inhibition of the branch-site recognition complex SF3B globally abolished co-transcriptional splicing in both species. Our findings revealed that splicing order does not strictly follow the order of transcription and is influenced by cis-regulatory elements. In human cells, introns with delayed splicing frequently neighbor alternative exons and are associated with RNA-binding factors. Moreover, neighboring introns in human cells tend to be spliced concurrently, implying that splicing occurs cooperatively. Thus, nano-COP unveils the organizational complexity of metazoan RNA processing.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1

Single-cell RNA sequencing (scRNA-seq) has become an established approach to profile entire transcriptomes of individual cells from different cell types, tissues, species, and organisms. Single-cell tagged reverse transcription sequencing (STRT-seq) is one of the early single-cell methods which utilize 5′ tag counting of transcripts. STRT-seq performed on microfluidics Fluidigm C1 platform (STRT-C1) is a flexible scRNA-seq approach that allows for accurate, sensitive and importantly molecular counting of transcripts at single-cell level. Herein, I describe the STRT-C1 method and the steps involved in capturing 96 cells across C1 microfluidics chip, cDNA synthesis, and preparing single-cell libraries for Illumina short-read sequencing.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1

The RNA-to-cDNA conversion step in transcriptomics experiments is widely recognised as inefficient and variable, casting doubt on the ability to do quantitative transcriptomics analyses. Multiple studies have focused on ways to optimise this process, resulting in contradictory recommendations. Here we explore the problem of reverse transcription efficiency using digital PCR and the RT method’s impact on subsequent data analysis. Using synthetic RNA standards, an example experiment is presented, outlining a method to (1) determine relevant efficiency and variability values and then to (2) incorporate this information into downstream analyses as a way to improve the accuracy of quantitative transcriptomics experiments.

Features SIRVs (Spike-in RNA Variant Control Mixes) – SIRV-Set 1

High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Highly parallel direct RNA sequencing on an array of nanopores

Daniel R Garalde, Elizabeth A Snell, Daniel Jachimowicz, Botond Sipos, Joseph H Lloyd, Mark Bruce, Nadia Pantic, Tigist Admassu, Phillip James, Anthony Warland, Michael Jordan, Jonah Ciccone, Sabrina Serra, Jemma Keenan, Samuel Martin, Luke McNeill, E Jayne Wallace, Lakmal Jayasinghe, Chris Wright, Javier Blasco, Stephen Young, Denise Brocklebank, Sissel Juul, James Clarke, Andrew J Heron & Daniel J Turner

Nature Methods, doi:10.1038/nmeth.4577

Sequencing the RNA in a biological sample can unlock a wealth of information, including the identity of bacteria and viruses, the nuances of alternative splicing or the transcriptional state of organisms. However, current methods have limitations due to short read lengths and reverse transcription or amplification biases. Here we demonstrate nanopore direct RNA-seq, a highly parallel, real-time, single-molecule method that circumvents reverse transcription or amplification steps. This method yields full-length, strand-specific RNA sequences and enables the direct detection of nucleotide analogs in RNA.

Features SIRVs (Spike-in RNA Variant Control Mixes)

SpaRC: Scalable Sequence Clustering using Apache Spark* REVIEW

Lizhen Shi, Xiandong Meng, Elizabeth Tseng, Michael Mascagni, Zhong Wang

BioRxiv, doi: 10.1101/246496

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at

Features SIRVs (Spike-in RNA Variant Control Mixes)

Analysing transcriptomes of cell populations is a standard molecular biology approach to understand how cells function. Recent methodological development has allowed performing similar experiments on single cells. This has opened up the possibility to examine samples with limited cell number, such as cells of the early embryo, and to obtain an understanding of heterogeneity within populations such as blood cell types or neurons. There are two major approaches for single-cell transcriptome analysis: quantitative reverse transcription PCR (RT-qPCR) on a limited number of genes of interest, or more global approaches targeting entire transcriptomes using RNA sequencing. RT-qPCR is sensitive, fast and arguably more straightforward, while whole-transcriptome approaches offer an unbiased perspective on a cell’s expression status.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Single-cell transcriptomics serves as a powerful tool to identify cell states within populations of cells, and to dissect underlying heterogeneity at high resolution. Single-cell transcriptomics on pluripotent stem cells has provided new insights into cellular variation, subpopulation structures and the interplay of cell cycle with pluripotency. The single-cell perspective has helped to better understand gene regulation and regulatory networks during exit from pluripotency, cell-fate determination as well as molecular mechanisms driving cellular reprogramming of somatic cells to induced pluripotent stage. Here we review the recent progress and significant findings from application of single-cell technologies on pluripotent stem cells along with a brief outlook on new combinatorial single-cell approaches that further unravel pluripotent stem cell states.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Single-cell RNASeq (scRNASeq) has emerged as a powerful method for quantifying the transcriptome of individual cells. However, the data from scRNASeq experiments is often both noisy and high dimensional, making the computational analysis non-trivial. Here we provide an overview of different experimental protocols and the most popular methods for facilitating the computational analysis. We focus on approaches for identifying biologically important genes, projecting data into lower dimensions and clustering data into putative cell-populations. Finally we discuss approaches to validation and biological interpretation of the identified cell-types or cell-states.

Features SIRVs (Spike-in RNA Variant Control Mixes)

RNA sequencing (RNA-seq) is a genomic approach for the detection and quantitative analysis of messenger RNA molecules in a biological sample and is useful for studying cellular responses. RNA-seq has fueled much discovery and innovation in medicine over recent years. For practical reasons, the technique is usually conducted on samples comprising thousands to millions of cells. However, this has hindered direct assessment of the fundamental unit of biology-the cell. Since the first single-cell RNA-sequencing (scRNA-seq) study was published in 2009, many more have been conducted, mostly by specialist laboratories with unique skills in wet-lab single-cell genomics, bioinformatics, and computation. However, with the increasing commercial availability of scRNA-seq platforms, and the rapid ongoing maturation of bioinformatics approaches, a point has been reached where any biomedical researcher or clinician can use scRNA-seq to make exciting discoveries. In this review, we present a practical guide to help researchers design their first scRNA-seq studies, including introductory information on experimental hardware, protocol choice, quality control, data analysis and biological interpretation.

Features SIRVs (Spike-in RNA Variant Control Mixes)

RNA sequencing (referred to as RNA-Seq with traditional sequencing technologies) has led to unprecedented advances in all fields of biology and medicine. It has been an invaluable tool for the study of human genetics and the pathology associated with disease. Transcript isoform expression and usage, for example, is a prominent source of variation between healthy and diseased tissues in a number of medical conditions, including cancer. RNA sequencing is also instrumental in identifying fusion transcripts present in a growing number of disorders. Sequencing of cDNA has also significantly aided viral pathogen characterisation and timely detection, drastically improving time-to-result compared to the gold-standard viral isolation in cell culture. While alternative methods relying on ELISA and RT-PCR often suffer from limited sensitivity and specificity, as well as from unresponsiveness to rapid viral evolution, RNA sequencing of virus-infected samples overcomes all these issues. The numerous applications of RNA sequencing are not, however, restricted to the human health research field. The approach has also been utilised in agricultural settings, to research, for example, the drought-induced stress response in plants. Lastly, RNA sequencing use in developmental biology has helped elucidate transcriptional program changes associated with various developmental events. Traditional sequencing technologies have, undoubtedly, made comprehensive transcriptome analysis possible and have led to numerous important developments in science. Nevertheless, there are some important limitations in the field that need to be addressed. Here, we will focus on nanopore technology’s suitability to tackle challenges in the areas of fulllength transcript identification, isoform characterisation and quantification and viral detection.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Next-generation sequencing (NGS) provides a broad investigation of the genome, and it is being readily applied for the diagnosis of disease-associated genetic features. However, the interpretation of NGS data remains challenging owing to the size and complexity of the genome and the technical errors that are introduced during sample preparation, sequencing and analysis. These errors can be understood and mitigated through the use of reference standards — well-characterized genetic materials or synthetic spike-in controls that help to calibrate NGS measurements and to evaluate diagnostic performance. The informed use of reference standards, and associated statistical principles, ensures rigorous analysis of NGS data and is essential for its future clinical use.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Nanopore Long-Read RNAseq Reveals Widespread Transcriptional Variation Among the Surface Receptors of Individual B cells

Ashley Byrne, Anna E Beaudin, Hugh E Olsen, Miten Jain, Charles Cole, Theron Palmer, Rebecca M DuBois, E. Camilla Forsberg, Mark Akeson, Christopher Vollmers

Nature Communications, doi:10.1038/ncomms16027

Understanding gene regulation and function requires a genome-wide method capable of capturing both gene expression levels and isoform diversity at the single cell level. Short-read RNAseq, while the current standard for gene expression quantification, is limited in its ability to resolve complex isoforms because it fails to sequence full-length cDNA copies of RNA molecules. Here, we investigated whether RNAseq using the long-read single-molecule Oxford Nanopore MinION sequencing technology (ONT RNAseq) would be able to identify and quantify complex isoforms without sacrificing accurate gene expression quantification. After successfully benchmarking our experimental and computational approaches on a mixture of synthetic transcripts, we analyzed individual murine B1a cells using a new cellular indexing strategy. Using the Mandalorion analysis pipeline we developed, we identified thousands of unannotated transcription start and end sites, as well as hundreds of alternative splicing events in these B1a cells. We also identified hundreds of genes expressed across B1a cells that displayed multiple complex isoforms, including several B cell specific surface receptors and the antibody heavy chain (IGH) locus. Our results show that not only can we identify complex isoforms, but also quantify their expression, at the single cell level.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single-cells, RNA profiling, and metagenomics (across multiple genomes). Technical artifacts and contamination can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous. Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data. Here we review current standards and their applications in genomics, including whole genomes, transcriptomes, mixed genomic samples (metagenomes), and the modified bases within each (epigenomes and epitranscriptomes). These standards, tools, and metrics are critical for quantifying the accuracy of NGS methods, which will be essential for robust approaches in clinical genomics and precision medicine.

Features SIRVs (Spike-in RNA Variant Control Mixes)

By profiling the transcriptomes of individual cells, single-cell RNA sequencing provides unparalleled resolution to study cellular heterogeneity. However, this comes at the cost of high technical noise, including cell-specific biases in capture efficiency and library generation. One strategy for removing these biases is to add a constant amount of spike-in RNA to each cell, and to scale the observed expression values so that the coverage of spike-in RNA is constant across cells. This approach has previously been criticized as its accuracy depends on the precise addition of spike-in RNA to each sample, and on similarities in behaviour (e.g., capture efficiency) between the spike-in and endogenous transcripts. Here, we perform mixture experiments using two different sets of spike-in RNA to quantify the variance in the amount of spike-in RNA added to each well in a plate-based protocol. We also obtain an upper bound on the variance due to differences in behaviour between the two spike-in sets. We demonstrate that both factors are small contributors to the total technical variance and have only minor effects on downstream analyses such as detection of highly variable genes and clustering. Our results suggest that spike-in normalization is reliable enough for routine use in single-cell RNA sequencing data analyses.

Features SIRVs (Spike-in RNA Variant Control Mixes)

SIRVs: Spike-In RNA Variants as External Isoform Controls in RNA-Sequencing

Lukas Paul, Petra Kubala, Gudrun Horner, Michael Ante, Igor Hollaender, Seitz Alexander, Torsten Reda

bioRxiv 080747; doi:

Spike-In RNA variants (SIRVs) enable for the first time the validation of RNA sequencing workflows using external isoform transcript controls. 69 transcripts, derived from seven human model genes, cover the eukaryotic transcriptome complexity of start- and end-site variations, alternative splicing, overlapping genes, and antisense transcription in a condensed format. Reference RNA samples were spiked with SIRV mixes, sequenced, and exemplarily four data evaluation pipelines were challenged to account for biases introduced by the RNA-Seq workflow. The deviations of the respective isoform quantifications from the known inputs allow to determine the comparability of sequencing experiments and to extrapolate to which degree alterations in an RNA-Seq workflow affect gene expression measurements. The SIRVs as external isoform controls are an important gauge for inter-experimental comparability and a modular spike-in contribution to clear the way for diagnostic RNA-Seq applications.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Power Analysis of Single Cell RNA‐Sequencing Experiments

Valentine Svensson, Kedar N Natarajan, Lam-Ha Ly, Ricardo J Miragaia, Charlotte Labalette, Iain C Macaulay, Ana Cvejic, Sarah A Teichmann

Nature Methods 14, 381–387 (2017) doi:10.1038/nmeth.4220

High-throughput single cell RNA sequencing (scRNA-seq) has become an established and powerful method to investigate transcriptomic cell-to-cell variation, and has revealed new cell types, and new insights into developmental process and stochasticity in gene expression. There are now several published scRNA-seq protocols, which all sequence transcriptomes from a minute amount of starting material. Therefore, a key question is how these methods compare in terms of sensitivity of detection of mRNA molecules, and accuracy of quantification of gene expression. Here, we assessed the sensitivity and accuracy of many published data sets based on standardized spike-ins with a uniform raw data processing pipeline. We developed a flexible and fast UMI counting tool ( which is compatible with all UMI based protocols. This allowed us to relate these parameters to sequencing depth, and discuss the trade offs between the different methods. To confirm our results, we performed experiments on cells from the same population using three different protocols. We also investigated the effect of RNA degradation on spike-in molecules, and the average efficiency of scRNA-seq on spike-in molecules versus endogenous RNAs.

Features SIRVs (Spike-in RNA Variant Control Mixes)

Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis

Jason L Weirather, Mariateresa de Cesare, Yunhao Wang, Paolo Piazza, Vittorio Sebastiano, Xiu-Jie Wang, David Buck, Kin Fai Au

F1000Research 2017, 6:100

Background: Given the demonstrated utility of Third Generation Sequencing [Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT)] long reads in many studies, a comprehensive analysis and comparison of their data quality and applications is in high demand. Methods: Based on the transcriptome sequencing data from human embryonic stem cells, we analyzed multiple data features of PacBio and ONT, including error pattern, length, mappability and technical improvements over previous platforms. We also evaluated their application to transcriptome analyses, such as isoform identification and quantification and characterization of transcriptome complexity, by comparing the performance of PacBio, ONT and their corresponding Hybrid-Seq strategies (PacBio+Illumina and ONT+Illumina).

Results: PacBio shows overall better data quality, while ONT provides a higher yield. As with data quality, PacBio performs marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis. In addition, Hybrid-Seq shows superior performance over long reads only in most transcriptome analyses.

Conclusions: Both PacBio and ONT sequencing are suitable for full-length single-molecule transcriptome analysis. As this first use of ONT reads in a Hybrid-Seq analysis has shown, both PacBio and ONT can benefit from a combined Illumina strategy. The tools and analytical methods developed here provide a resource for future applications and evaluations of these rapidly-changing technologies.

Features SIRVs (Spike-in RNA Variant Control Mixes)