Long-read sequencing uncovers a complex transcriptome topology in varicella zoster virus

István Prazsák, Norbert Moldován, Zsolt Balázs, Dóra Tombácz, Klára Megyeri, Attila Szűcs, Zsolt Csabai and Zsolt Boldogkői

BMC Genomics, doi: 10.1186/s12864-018-5267-8

Varicella zoster virus (VZV) is a human pathogenic alphaherpesvirus harboring a relatively large DNA molecule. The VZV transcriptome has already been analyzed by microarray and short-read sequencing analyses. However, both approaches have substantial limitations when used for structural characterization of transcript isoforms, even if supplemented with primer extension or other techniques. Among others, they are inefficient in distinguishing between embedded RNA molecules, transcript isoforms, including splice and length variants, as well as between alternative polycistronic transcripts. It has been demonstrated in several studies that long-read sequencing is able to circumvent these problems.

In this work, we report the analysis of the VZV lytic transcriptome using the Oxford Nanopore Technologies sequencing platform. These investigations have led to the identification of 114 novel transcripts, including mRNAs, non-coding RNAs, polycistronic RNAs and complex transcripts, as well as 10 novel spliced transcripts and 25 novel transcription start site isoforms and transcription end site isoforms. A novel class of transcripts, the nroRNAs are described in this study. These transcripts are encoded by the genomic region located in close vicinity to the viral replication origin. We also show that the ORF63 exhibits a complex structural variation encompassing the splice sites of VZV latency transcripts. Additionally, we have detected RNA editing in a novel non-coding RNA molecule.

Our investigations disclosed a composite transcriptomic architecture of VZV, including the discovery of novel RNA molecules and transcript isoforms, as well as a complex meshwork of transcriptional read-throughs and overlaps. The results represent a substantial advance in the annotation of the VZV transcriptome and in understanding the molecular biology of the herpesviruses in general.

Features TeloPrime Full-Length cDNA Amplification Kit

High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples.

Features TeloPrime Full-Length cDNA Amplification Kit

RNA-sequencing has revolutionized transcriptomics and the way we measure gene expression (Wang et al., 2009). As of today, short-read RNA sequencing is more widely used, and due to its low price and high throughput, is the preferred tool for the quantitative analysis of gene expression. However, the annotation of transcript isoforms is rather difficult using only short-read sequencing data, because the reads are shorter than most transcripts (Steijger et al., 2013). Long-read sequencing, on the other hand, can provide full contig information about transcripts, including exon-connectivity, and its merits in transcriptome profiling are being increasingly acknowledged (Sharon et al., 2013; Abdel-Ghany et al., 2016; Wang et al., 2016; Kuo et al., 2017). Due to the relatively low throughput of current long-read sequencing technologies, they can only characterize smaller transcriptomes in high-depth (Weirather et al., 2017).

The Human cytomegalovirus (HCMV) is a ubiquitous betaherpesvirus, which can cause mononucleosis-like symptoms in adults (Cohen and Corey, 1985), and severe life-threatening infections in newborns (Wen et al., 2002). Latent HCMV infection has recently been implicated to affect cancer formation (Dziurzynski et al., 2012; Jin et al., 2014). Examining the transcriptome of the virus can go a long way in helping understand its molecular biology. Short-read RNA sequencing studies have discovered splice junctions and non-coding transcripts (Gatherer et al., 2011) and have shown that the most abundant HCMV transcripts are similarly expressed in different cell types (Cheng et al., 2017). Our long-read RNA sequencing experiments using the Pacific Biosciences (PacBio) RSII platform revealed a great number of transcript isoforms, polycistronic RNAs and transcriptional overlaps (Balázs et al., 2017a).

Here, we present the dual-platform long-read RNA sequencing dataset of two HCMV-infected fibroblast samples. We have sequenced the same RNA population that we have previously sequenced with the PacBio RS II platform (Balázs et al., 2017b), but now using the PacBio Sequel and Oxford Nanopore Technologies (ONT) MinION platforms. These data, apart from providing a more profound picture of the lytic HCMV transcriptome, can also be used to compare the current technologies. A further sample was prepared, using lytic HCMV RNAs. This sample was subjected to ONT Cap-selected cDNA sequencing (Cap-Seq) in order to allow better characterization of the transcription start sites, and also to direct (d)RNA sequencing in order to avoid reverse-transcription (RT) and PCR artifacts. We report of sequencing of approximately 100 GB raw data (Supplementary Table 1). The CapSeq by the MinION platform yielded the highest read count, the throughputs of the Sequel platform and the ONT dRNA sequencing both lagged behind (summarized in Figure 1A); both technologies nonetheless offer significant benefits. The Sequel platform is more accurate and the dRNA sequencing is free of RT and PCR artifacts. The read length distribution shows that the Sequel platform has a similar molecule-size preference to the RSII platform, while the MinION platform sequences more short reads (Figure 1B). The length-distribution of the non-cap selected cDNA sequencing reads are different from the other ONT reads, because this library was size-selected (>500 nt).

Features TeloPrime Full-Length cDNA Amplification Kit

Comparative genome analysis of programmed DNA elimination in nematodes

Jianbin Wang, Shenghan Gao, Yulia Mostovoy, Yuanyuan Kang, Maxim Zagoskin, Yongqiao Sun, Bing Zhang, Laura K. White, Alice Easton, Thomas B. Nutman, Pui-Yan Kwok, Songnian Hu, Martin K. Nielsen and Richard E. Davis

Genome Research, doi: 10.1101/gr.225730.117

Programmed DNA elimination is a developmentally regulated process leading to the reproducible loss of specific genomic sequences. DNA elimination occurs in unicellular ciliates and a variety of metazoans, including invertebrates and vertebrates. In metazoa, DNA elimination typically occurs in somatic cells during early development, leaving the germline genome intact. Reference genomes for metazoa that undergo DNA elimination are not available. Here, we generated germline and somatic reference genome sequences of the DNA eliminating pig parasitic nematode Ascaris suum and the horse parasite Parascaris univalens. In addition, we carried out in-depth analyses of DNA elimination in the parasitic nematode of humans, Ascaris lumbricoides, and the parasitic nematode of dogs, Toxocara canis. Our analysis of nematode DNA elimination reveals that in all species, repetitive sequences (that differ among the genera) and germline-expressed genes (approximately 1000–2000 or 5%–10% of the genes) are eliminated. Thirty-five percent of these eliminated genes are conserved among these nematodes, defining a core set of eliminated genes that are preferentially expressed during spermatogenesis. Our analysis supports the view that DNA elimination in nematodes silences germline-expressed genes. Over half of the chromosome break sites are conserved between Ascaris and Parascaris, whereas only 10% are conserved in the more divergent T. canis. Analysis of the chromosomal breakage regions suggests a sequence-independent mechanism for DNA breakage followed by telomere healing, with the formation of more accessible chromatin in the break regions prior to DNA elimination. Our genome assemblies and annotations also provide comprehensive resources for analysis of DNA elimination, parasitology research, and comparative nematode genome and epigenome studies.

Features TeloPrime Full-Length cDNA Amplification Kit

Transcriptomic study of Herpes simplex virus type-1 using full-length sequencing techniques

Zsolt Boldogkői, Attila Szűcs, Zsolt Balázs, Donald Sharon, Michael Snyder & Dóra Tombácz

Scientific Data, Article number: 180266 (2018)

Herpes simplex virus type-1 (HSV-1) is a human pathogenic member of the Alphaherpesvirinae subfamily of herpesviruses. The HSV-1 genome is a large double-stranded DNA specifying about 85 protein coding genes. The latest surveys have demonstrated that the HSV-1 transcriptome is much more complex than it had been thought before. Here, we provide a long-read sequencing dataset, which was generated by using the RSII and Sequel systems from Pacific Biosciences (PacBio), as well as MinION sequencing system from Oxford Nanopore Technologies (ONT). This dataset contains 39,096 reads of inserts (ROIs) mapped to the HSV-1 genome (X14112) in RSII sequencing, while Sequel sequencing yielded 77,851 ROIs. The MinION cDNA sequencing altogether resulted in 158,653 reads, while the direct RNA-seq produced 16,516 reads. This dataset can be utilized for the identification of novel HSV RNAs and transcripts isoforms, as well as for the comparison of the quality and length of the sequencing reads derived from the currently available long-read sequencing platforms. The various library preparation approaches can also be compared with each other.

Features TeloPrime Full-Length cDNA Amplification Kit

Lytic Transcriptome Dataset of Varicella Zoster Virus Generated by Long-read Sequencing

Dóra Tombácz, Donald Sharon, Attila Szűcs, Norbert Moldován, Michael Snyder, Zsolt Boldogkői

Frontiers in Genetics, doi: 10.3389/fgene.2018.00460


Varicella zoster virus (VZV) belongs to the Alphaherpesvirinae subfamily of the Herpesviridae family. It is the etiological agent of chickenpox (varicella) caused by primary infection and shingles (zoster), which is due to reactivation of the virus from latency (Kennedy, 2002). Many countries have adopted recommendations for routine immunization of children and susceptible adults against VZV. The VZV virion is composed of an icosahedral nucleocapsid surrounded by a tegument layer, which is covered by an envelope derived from the host cell membrane with incorporated viral glycoproteins (Maresova et al., 2005). The genome of VZV consists of a linear double-stranded DNA molecule and is approximately 125-kbp in size, which contains more than 70 annotated open reading frames (ORFs) (Tyler et al., 2007). The transcription of the virus is strictly regulated by cascade-like processes. First, the immediate-early (IE) transcripts are expressed, which is then followed by the expression of the early (E), and then the late (L) kinetic classes of transcripts (Reichelt et al., 2009). The IE ORF62 gene of VZV encodes the major transactivator, which controls the expression of other viral genes. The viral E genes encode proteins that are used in DNA replication, while L genes code for the structural elements of the virus.
High-throughput short-read sequencing (SRS) techniques have revolutionized transcriptome research (Delseny et al., 2010). These techniques have also been utilized in the investigation of herpesvirus gene expression (e.g. Chambers et al., 1999; Ebrahimi et al., 2003; Baird et al., 2014; Oláh et al., 2015). However, the SRS approach has severe limitations in comparison to long-read sequencing (LRS), including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) platforms. LRS techniques have been used before in transcriptome studies of the herpesviruses (Tombácz et a, 2016; O’Grady et al., 2016; Tombácz et al., 2017; Balázs et al., 2017a; 2017b; Moldován et al., 2018). These studies uncovered a very complex transcriptome, which included the identification of a large number of novel RNA molecules and transcript isoforms (Tombácz et al., 2015; Tombácz et al., 2017; Balázs et al., 2017a). Moreover, an extended meshwork of overlaps between the transcripts was also detected by these studies (Tombácz et a, 2016; Moldován et al., 2018).
The presented data report is aimed toward providing a new, comprehensive transcript catalog of VZV using an LRS approach for the first time. In this study, we applied the ONT MinION device and various full-length cDNA sequencing protocols that capture the entire poly(A)-transcriptome of VZV.

Features TeloPrime Full-Length cDNA Amplification Kit

Transcriptome-wide survey of pseudorabies virus using next- and third-generation sequencing platforms

Dóra Tombácz, Donald Sharon, Attila Szűcs, Norbert Moldován, Michael Snyder & Zsolt Boldogkői

Scientific Data, doi:10.1038/sdata.2018.119

Pseudorabies virus (PRV) is an alphaherpesvirus of swine. PRV has a large double-stranded DNA genome and, as the latest investigations have revealed, a very complex transcriptome. Here, we present a large RNA-Seq dataset, derived from both short- and long-read sequencing. The dataset contains 1.3 million 100 bp paired-end reads that were obtained from the Illumina random-primed libraries, as well as 10 million 50 bp single-end reads generated by the Illumina polyA-seq. The Pacific Biosciences RSII non-amplified method yielded 57,021 reads of inserts (ROIs) aligned to the viral genome, the amplified method resulted in 158,396 PRV-specific ROIs, while we obtained 12,555 ROIs using the Sequel platform. The Oxford Nanopore’s MinION device generated 44,006 reads using their regular cDNA-sequencing method, whereas 29,832 and 120,394 reads were produced by using the direct RNA-sequencing and the Cap-selection protocols, respectively. The raw reads were aligned to the PRV reference genome (KJ717942.1). Our provided dataset can be used to compare different sequencing approaches, library preparation methods, as well as for validation and testing bioinformatic pipelines.

Features TeloPrime Full-Length cDNA Amplification Kit

Multi-Platform Sequencing Approach Reveals a Novel Transcriptome Profile in Pseudorabies Virus

Norbert Moldován, Dóra Tombácz, Attila Szűcs, Zsolt Csabai, Michael Snyder and Zsolt Boldogkői

Frontiers in Microbiology, doi:10.3389/fmicb.2017.02708

Third-generation sequencing is an emerging technology that is capable of solving several problems that earlier approaches were not able to, including the identification of transcripts isoforms and overlapping transcripts. In this study, we used long-read sequencing for the analysis of pseudorabies virus (PRV) transcriptome, including Oxford Nanopore Technologies MinION, PacBio RS-II, and Illumina HiScanSQ platforms. We also used data from our previous short-read and long-read sequencing studies for the comparison of the results and in order to confirm the obtained data. Our investigations identified 19 formerly unknown putative protein-coding genes, all of which are 5′ truncated forms of earlier annotated longer PRV genes. Additionally, we detected 19 non-coding RNAs, including 5′ and 3′ truncated transcripts without in-frame ORFs, antisense RNAs, as well as RNA molecules encoded by those parts of the viral genome where no transcription had been detected before. This study has also led to the identification of three complex transcripts and 50 distinct length isoforms, including transcription start and end variants. We also detected 121 novel transcript overlaps, and two transcripts that overlap the replication origins of PRV. Furthermore, in silico analysis revealed 145 upstream ORFs, many of which are located on the longer 5′ isoforms of the transcripts.

Features TeloPrime Full-Length cDNA Amplification Kit

This doctoral thesis consist of two parts: The first part describes a global survey of cisregulatory divergence in mammalian translation, where I applied mRNA sequencing and deep sequencing-based polysome profiling to quantify translational efficiency in F1 hybrid mice. The F1 progeny between Mus musculus C57BL/6J and Mus spretus SPRET/EiJ was chosen as a model system because the two have the largest number of genetic variants among all mouse strains with high-quality genome assemblies available. This large genomic divergence 1) provides a large number of potential regulatory variants between the two strains and 2) enables a sequencing-based approach to distinguish allelic RNA transcripts. The high quality of the data was demonstrated by employing two independent validation approaches, PacBio full-length sequencing and ribosome profiling. In total, 1008 genes (14.1%) were identified exhibiting significant allelic difference in translational efficiency. Several sequence features were associated with the observed allelic divergence in translation, including local RNA secondary structure near the start codon and proximal out-of-frame upstream AUGs. Finally, cis-effects are quantitatively comparable between transcriptional and translational regulation and these effects are more frequently compensatory between the two processes, suggesting a role of the translational regulation in buffering transcriptional noise and thereby maintaining the robustness of protein expression.

In the second part, I developed novel technology CAPTRE to measure the translational status of distinct mRNA TL isoforms. In mouse fibroblasts, a total of 22,357 TSSs derived from 10,875 protein-coding genes were identified. Among 4153 genes expressing multiple TSSs, 745 exhibited significant TE difference between their alternative TL isoforms. Longer isoforms were more frequently associated with lower TE and the global impact of several regulatory elements was also revisited, such as uORFs, cap-adjacent stable RNA secondary structures as well as 5′-terminal oligopyrimidine tract. In addition, several novel sequence motifs that can affect translation activity were identified and their effect was validated using two reporter systems. Finally, quantitative models combining different features identified in this study explained approximately 60% of the variance of the TE difference observed between TL isoforms.
This study provides novel mechanistic insights into translational regulation and characterizes the potential coupling between translational and transcriptional regulation in mammalian cells.

Features TeloPrime Full-Length cDNA Amplification Kit

Thyroglobulin Represents a Novel Molecular Architecture of Vertebrates

Guillaume Holzer, Yoshiaki Morishita, Jean-Baptiste Fini, Thibault Lorin, Benjamin Gillet, Sandrine Hughes, Marie Tohmé, Gilbert Deléage, Barbara Demeneix, Peter Arvan and Vincent Laudet

JBC.M116.719047. doi: 10.1074/jbc.M116.719047

Thyroid hormones modulate not only multiple functions in vertebrates (energy metabolism, central nervous system function, seasonal changes in physiology and behavior), but also in some non-vertebrates where they control critical post-embryonic developmental transitions such as metamorphosis. Despite their obvious biological importance, the thyroid hormone precursor protein, thyroglobulin (Tg), has been experimentally investigated only in mammals. This may bias our view of how thyroid hormones are produced in other organisms. In this study, we searched genomic databases and found Tg orthologs in all vertebrates including the sea lamprey (Petromyzon marinus). We cloned a full-size Tg coding sequence from western clawed frog (Xenopus tropicalis) and zebrafish (Dano rerio). Comparisons between the representative mammal, amphibian, teleost fish, and basal vertebrate indicate that all of the different domains of Tg, as well as Tg regional structure, are conserved throughout the vertebrates. Indeed, in Xenopus, zebrafish and lamprey Tgs, key residues, including the hormonogenic tyrosines and the disulfide bond-forming cysteines critical for Tg function are well conserved, despite overall divergence of amino acid sequences. We uncovered upstream sequences that include start codons of zebrafish and Xenopus Tgs, and experimentally proved that these are full-length secreted proteins, which are specifically recognized by antibodies against rat Tg. By contrast, we have not been able to find any orthologs of Tg among non-vertebrate species. Thus, Tg appears to be a novel protein elaborated as a single event at the base of vertebrates and virtually unchanged thereafter.

Features TeloPrime Full-Length cDNA Amplification Kit

cDNA Library Enrichment of Full Length Transcripts for SMRT Long Read Sequencing

Maria Cartolano, Bruno Huettel, Benjamin Hartwig, Richard Reinhardt, Korbinian Schneeberger

PLoS ONE 11(6):e0157779. doi:10.1371/journal.pone.0157779

The utility of genome assemblies does not only rely on the quality of the assembled genome sequence, but also on the quality of the gene annotations. The Pacific Biosciences Iso-Seq technology is a powerful support for accurate eukaryotic gene model annotation as it allows for direct readout of full-length cDNA sequences without the need for noisy short read-based transcript assembly. We propose the implementation of the TeloPrime Full Length cDNA Amplification kit to the Pacific Biosciences Iso-Seq technology in order to enrich for genuine full-length transcripts in the cDNA libraries. We provide evidence that TeloPrime outperforms the commonly used SMARTer PCR cDNA Synthesis Kit in identifying transcription start and end sites in Arabidopsis thaliana. Furthermore, we show that TeloPrime-based Pacific Biosciences Iso-Seq can be successfully applied to the polyploid genome of bread wheat (Triticum aestivum) not only to efficiently annotate gene models, but also to identify novel transcription sites, gene homeologs, splicing isoforms and previously unidentified gene loci.

Features TeloPrime Full-Length cDNA Amplification Kit

Transcription initiated at alternative sites can produce mRNA isoforms with different 5ʹUTRs, which are potentially subjected to differential translational regulation. However, the prevalence of such isoform‐specific translational control across mammalian genomes is currently unknown. By combining polysome profiling with high‐throughput mRNA 5ʹ end sequencing, we directly measured the translational status of mRNA isoforms with distinct start sites. Among 9,951 genes expressed in mouse fibroblasts, we identified 4,153 showed significant initiation at multiple sites, of which 745 genes exhibited significant isoform‐divergent translation. Systematic analyses of the isoform‐specific translation revealed that isoforms with longer 5ʹUTRs tended to translate less efficiently. Further investigation of cis‐elements within 5ʹUTRs not only provided novel insights into the regulation by known sequence features, but also led to the discovery of novel regulatory sequence motifs. Quantitative models integrating all these features explained over half of the variance in the observed isoform‐divergent translation. Overall, our study demonstrated the extensive translational regulation by usage of alternative transcription start sites and offered comprehensive understanding of translational regulation by diverse sequence features embedded in 5ʹUTRs.

Features TeloPrime Full-Length cDNA Amplification Kit