The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Fairlie Reese et al.
Posted by Likitha Nimmagadda
Score:
Importance – 4
Strength of Evidence – 4
Clarity – 4
Peer Review: This paper presents a novel, comprehensive approach to using long-read RNA sequencing to look at transcript structure diversity in a high-throughput, quantitative manner. They use sequencing datasets from both long-read and matching short-read or microRNA sequencing to compare expression profiles and structural variation across species (human and mouse) and tissues. In doing so, they present a novel quantification method of using “Transcription Triplets,” which is a combination of a transcript’s start site (TSS), end site (TES), and exon junction chain (EC) to generate a transcript’s splicing ratio. Using this, they present a number of key findings. First, they show that most of the transcripts identified have more novel TSSs than they do novel TESs. Furthermore, of the protein-coding genes expressed, a significant amount of their expression was accounted for by transcripts that had a TMP >=100, indicating high transcript expression, but also shows that most human genes have more than one predominant transcript driving expression. Next, they show that Elastin is a gene that they found to have the most transcript structure diversity that varies across tissue. Specifically, they show that while ELN has 32 major transcripts in the lung, 31 of them use the same TSS and one of two TESs, thus showing that a majority of the variation is driven by alternative ECs. Lastly, they show that in a species comparison, there are large changes in transcript structure diversity. Using ARF4 as an example, they found that in human samples, ARF4 had high splicing whereas in mouse samples, it instead has high TESs. Given this, there are a few shortcomings of this paper that need to be addressed. First, a majority of the analyses presented focused on the human dataset. Given their finding that human and mouse data have significant differences in transcript structural diversity, they need to further substantiate this claim by looking at differences in more than just the one gene presented. Next, there is a distinct lack of detail in the sample preparation and processing. Especially since LR-seq is highly sensitive to degradation, they include samples that are not of high enough quality (RIN <8), which skews their data and thus their claims presented, thus warranting discussion and investigation. Lastly, they use data from both cell lines and tissues, but it’s been shown that expression profiles differ significantly depending on the source. Given this, it would have been nice if they had added a mention of this briefly in the discussion and added a figure to show the variation (if any). This is also true for age, especially for the mouse dataset, so it would have been nice to see a breakdown of this too. The findings presented in this paper are novel and highly impactful as they present a strong argument for the differentiation between human and mouse samples in studying gene expression and transcript variation, and an importance on more than one predominant transcript. But, these concerns need to be addressed to provide certainty in their results.
Preprint Links
Responses