Le HS et al. (2013), Probabilistic error correction for RNA sequencing.

ECB-ART-42826

Nucleic Acids Res 2013 May 01;4110:e109. doi: 10.1093/nar/gkt215.

Show Gene links Show Anatomy links

Probabilistic error correction for RNA sequencing.

Le HS , Schulz MH , McCauley BM , Hinman VF , Bar-Joseph Z .

???displayArticle.abstract???
Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)-based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated. Supporting website: http://sb.cs.cmu.edu/seecer/.

???displayArticle.pubmedLink??? 23558750
???displayArticle.pmcLink??? PMC3664804
???displayArticle.link??? Nucleic Acids Res
???displayArticle.grants??? [+]

Genes referenced: impact LOC100887844 LOC115925415 LOC583082

???attribute.lit??? ???displayArticles.show???

	Figure 1. An overview of SEECER. Step 1: We select a random read that has not yet been assigned to any contig HMM. Next, we extract all reads with at least k consecutive nucleotides that overlap with the selected read. Step 2: We cluster all reads and then select the most coherent subset as the initial set of the contig HMM. Step 3: We learn an initial HMM using the alignment specified by the k-mer matches of selected reads. Step 4: We use the consensus sequence defined by the contig HMM to extract additional reads from our unassigned set. These additional reads are used to extend the HMM in both directions. Step 5: When no more reads can be found to extend the HMM, we determine for each of the reads that were used to construct the HMM the likelihood of being generated by this contig HMM. For those with a likelihood above a certain threshold, we use the HMM consensus to correct errors. Step 6: We remove the reads that are assigned or corrected from the unassigned pool. See â€˜Materials and Methodsâ€™ section for complete details.
	Figure 2. The distribution of mismatches to the reference of pair-mapped reads (using TopHat alignment) of the 55 M paired-end 45 bp reads of human T cells dataset: only reads that are aligned both before and after error correction are shown.
	Figure 3. An illustrating example how Oases benefits from SEECER error correction. Top: Tophat read alignments in the EIF3CL gene for exons 9â€“13 before (first track) and after (second track) SEECER correction with human data. Colored dots highlight positions with deviations to the reference sequence in the gray read alignments. Bottom: Summary view of the whole region displaying the longest transfrag assembled. Oases assembled the transcript ENST00000380876 (EIF3CL) to 95% of its length with SEECER corrected data (red transfrag), whereas it was only assembled to 45% of its length when using the original data (blue transfrag).
	Figure 4. De novo assembly of sea cucumber data. (A) A living sea cucumber Parastichopus parvimensis. (B) Distribution of BlastX matches of sea cucumber transfrags to known sea urchin peptides. The percentages represent the subset of sea urchin peptides that we have significantly matched to at least one transfrag in time point 1 and/or time point 2 and those that were not matched to any transfrag. (C) Ethidium bromideâ€“stained image of PCR products amplified from sea cucumber cDNA. Primer pairs were designed against 14 assembled transfrags, seven of which matched to known peptides of RNAs (top row), and seven other that had no match in the database (bottom row). Standard ladders of 100-bp size are in the first and last lanes. Each lane is followed by the appropriate no template control to demonstrate that amplification was not due to non-specific contamination.

References [+] :

Bao, SEED: efficient clustering of next-generation sequences. 2011, Pubmed