In higher eukaryotes, multiple transcript isoforms can originate from a single gene, diversifying the transcriptome and, consequently, the proteome. Transcript isoforms occur due to variations in transcription initiation, splicing, and polyadenylation, and switches between transcript isoforms occur in many physiological processes, cellular differentiation, and development. Transcript variants are especially known to be dysregulated in the context of various diseases, including cancer. Consequently, examining the transcriptome diversity of cells at the isoform level, and not only at the gene level, gives us a real insight into the true function of those genes and can be key to resolving mechanisms underpinning physiological and pathological conditions.
The level of completion and accuracy of a reference annotation will impact all downstream data analyses, from gene expression to predictions of gene function. Even though manual curation of gene annotations is very valuable, it is labor-intensive and time-consuming. Especially for multicellular organisms, in which some cell types may be difficult to access, even the most comprehensive annotation approaches may leave gaps. To understand how transcriptome architecture varies during development and in response to disease, it is therefore valuable to have an automated method that accurately identifies transcript isoforms.
We have invited Michael Schon, PhD, a postdoctoral researcher at Wageningen University & Research, who recently published Bookend, a new software package for precise end-guided transcriptome assembly, to talk to us about transcript assembly, its challenges, and solutions that his work proposes.
Lexogen: Hi, Michael! Congrats on your recent publication, and many thanks for agreeing to share your knowledge and expertise in transcript assembly with us. We thought it would be great to start with some basics first to provide some ground knowledge for beginners in RNA-Seq and transcript assembly before we dive deeper into your work.
So, how do scientists nowadays experimentally identify transcript isoforms?
Michael Schon: Fortunately for RNA biologists, this has been a very active area of research over the past couple of decades, and today we have access to a huge array of methods for sequencing RNA. As no single method is perfect for answering every question about transcript isoform diversity, an entire ecosystem of protocols and platforms has emerged.
One needs to ask some important questions before proceeding with any given protocol. If I can only sequence n bases, how should they be distributed? Do I care about capturing more total molecules or more bases per molecule? For example, many single-cell sequencing methods barcode the 5′ or 3′ ends of RNA molecules from thousands to millions of cells, and modern short-read sequencing platforms can cost-effectively sample billions of these fragments in one flow cell. With this strategy, you can build a detailed quantitative picture of transcription start site usage (5′ ends) or alternative polyadenylation (3′ ends) at cellular resolution in your tissue. That said, short fragments from either end of your RNA tell you little about splicing variation and have no power to show whether preferences of RNA start, end, and splice sites depend on each other. However, this is where long-read sequencing platforms shine, and both PacBio and Oxford Nanopore offer ways to sequence entire molecules from one end to the other. I currently prefer a hybrid approach, combining long- and short-read sequencing. Long reads are an excellent raw material for a qualitative picture of what transcript isoforms exist. Short-read techniques give a higher-resolution quantitative snapshot of these isoforms across many cells or samples.
Lexogen: What is transcript assembly, when is it used, and what are the biggest challenges when performing it?
Michael Schon: Investigations into the RNA of new organisms, tissues, or diseases require a catalog of novel transcripts to be discovered, and any given RNA-Seq experiment will render an incomplete snapshot of the real population, but different techniques will give different information content. It was standard to randomly fragment RNA before reverse transcription and sequence 50 bases or fewer from one strand of the resulting cDNA. These short, unstranded fragments lack the information to reconstruct full transcript isoforms accurately. Imagine trying to solve a 5,000-piece jigsaw puzzle, but half of it is missing, there are no edge pieces, and you don’t know which side faces up. At this point, you could either give up or try to make a simpler puzzle. Strand-specific protocols and paired-end sequencing brought improvement, but it was still impossible to confidently delineate borders where one transcript ends and another begins.
Template-switching reverse transcription (so-called SMART methods) allowed full-length RNA molecules to be reverse transcribed to cDNA with a distinct sequence tag labeling each end. Whether the end-labeled cDNA is sequenced as short fragments (the Smart-seq family of single-cell protocols) or as full-length molecules (PacBio Iso-seq, ONT Direct cDNA sequencing), it records the precise locations of the RNA 5′ cap and 3′ poly(A) tail. Long reads and end labels are game changers for transcript annotation because they enable a complete description of the original molecule.
Lexogen: How can we check if the transcript assembly is correct?
Michael Schon: Even with end-to-end sequencing data, several types of errors can occur in the steps between RNA and annotation. Oligo-d(T) primers can bind to sites other than the poly(A) tail, causing false 3′ ends. Template switching can occur prematurely on degraded RNA, strong secondary structures, or repetitive sequences to create false 5′ ends or splice sites. Base calling errors are introduced during sequencing. Furthermore, alignment software cannot always map reads correctly and uniquely to a reference genome, and incomplete sampling causes gaps in coverage. As we cannot know the true RNA population of a cell or tissue, there is no way to say with certainty that the assembly is a complete and accurate representation of the underlying RNAs. If we could work with an RNA pool whose exact sequences and abundances are known, it would be possible to directly measure how often errors arise in our assembly. To this purpose, spike-in RNAs were developed, as adding synthetic RNA molecules to the sample gives a ground truth to calibrate against. The accuracy of the assembled spike-in sequences provides a proxy for how trustworthy the rest of the assembly will be.
Lexogen: You recommend using spike-in RNA controls when studying transcript isoforms and doing transcript assembly. Are there some differences between available spike-in RNA controls regarding their usefulness?
Michael Schon: Like sequencing methods, not all spike-in molecules are equally valuable for all questions. The idea of having a ground truth for assembly is nice on paper, but it only works if the spike-ins have a similar kind and amount of complexity versus the real problem. If a child can sled down a hill in their backyard, this doesn’t give me the confidence to send them down a black diamond ski slope.
The ERCC (External RNA Controls Consortium, A/N) spike-ins are a standard set of external RNA controls designed to assess the accuracy of RNA quantification methods. However, each molecule in the ERCC set has one discrete sequence, which differs significantly from the pool of transcripts in, e.g., plant or animal cells. As one gene can produce many isoforms, two isoforms can differ by only a few nucleotides along their whole sequence. The benefit of Lexogen’s Spike-in RNA Variants (SIRVs) is that they mimic the complexity of eukaryotic isoforms by modeling human genes. The 69 synthetic isoforms map to an artificial “genome” of seven genes and simulate a dizzying array of transcript variation, including alternate start and end sites, antisense RNA, exon skipping and intron retention, mutually exclusive exons, and even noncanonical splice junctions. This is why the SIRVs are a proper stress test for transcript assembly.
Lexogen: Tell us more about the gaps in transcript assembly your research aims to bridge and how you have tackled these challenges in your recent publication.
Michael Schon: My main interest has been to discover rare transcripts in rare tissues. Much of the work during my PhD was focused on embryonic development in the model plant Arabidopsis thaliana. The tiny amount of RNA we could recover from early embryos necessitated developing and optimizing ultra-low-input protocols. This sparked a love for single-cell sequencing. However, when I used standard assembly software on data from single-cell protocols like Smart-seq2, the number of assembly errors swamped out any genuine signal from novel isoforms or long noncoding RNAs. Looking at how assemblers made their errors, I recognized that they were all missing something: RNA starts and ends. Knowing that this information existed in the raw data, I set out to build an assembler that could identify and utilize end-labeled reads. This idea evolved into Bookend, a novel framework for “end-guided assembly” from short and long reads recently published in Genome Biology.
Lexogen: You were interested in if you could get an accurate transcript assembly from single-cell RNA-Seq. How did you use SIRVs to answer this question? What could you conclude from your experiments?
Michael Schon: While developing Bookend, I realized it is difficult to prove what a “better assembly” looks like. Even though I could see that accounting for RNA starts and ends during assembly gave much closer concordance with the raw Arabidopsis data, this could not tell me how good or bad the assembly is in absolute terms. Sarah Teichmann’s group published a great benchmarking dataset in 2019 from single mouse embryonic stem cells. They added both ERCCs and SIRV spike-ins (SIRV-Set 2, A/N) to each cell in a 96-well plate, which gave me 96 chances to evaluate Bookend’s performance on single cells from another kingdom, as well as a critical ground truth to see how far from perfect the assemblies still are. By taking advantage of 5′ and 3′ end labels, Bookend assembled more true SIRVs and fewer false SIRVs than the best existing assemblers. That said, the bar was low: at its best, only 56 % of Bookend assemblies were end-to-end correct SIRV isoforms, and only 25 of the 69 were found. A few more tricks to boost Bookend’s accuracy on biological RNA (cap detection, meta-assembly) are included in the publication. Still, I think the SIRV results highlight a fundamental limitation of short-read sequencing to perfectly resolve the architecture of complex transcripts. Hence, I am eagerly watching as the technology matures for long-read single-cell sequencing!