Click
here to close Hello! We notice that
you are using Internet Explorer, which is not supported by Echinobase
and may cause the site to display incorrectly. We suggest using a
current version of Chrome,
FireFox,
or Safari.
Nucleic Acids Res
2013 May 01;4110:e109. doi: 10.1093/nar/gkt215.
Show Gene links
Show Anatomy links
Probabilistic error correction for RNA sequencing.
Le HS
,
Schulz MH
,
McCauley BM
,
Hinman VF
,
Bar-Joseph Z
.
???displayArticle.abstract???
Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)-based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated. Supporting website: http://sb.cs.cmu.edu/seecer/.
Figure 1. An overview of SEECER. Step 1: We select a random read that has not yet been assigned to any contig HMM. Next, we extract all reads with at least k consecutive nucleotides that overlap with the selected read. Step 2: We cluster all reads and then select the most coherent subset as the initial set of the contig HMM. Step 3: We learn an initial HMM using the alignment specified by the k-mer matches of selected reads. Step 4: We use the consensus sequence defined by the contig HMM to extract additional reads from our unassigned set. These additional reads are used to extend the HMM in both directions. Step 5: When no more reads can be found to extend the HMM, we determine for each of the reads that were used to construct the HMM the likelihood of being generated by this contig HMM. For those with a likelihood above a certain threshold, we use the HMM consensus to correct errors. Step 6: We remove the reads that are assigned or corrected from the unassigned pool. See ‘Materials and Methods’ section for complete details.
Figure 2. The distribution of mismatches to the reference of pair-mapped reads (using TopHat alignment) of the 55 M paired-end 45 bp reads of human T cells dataset: only reads that are aligned both before and after error correction are shown.
Figure 3. An illustrating example how Oases benefits from SEECER error correction. Top: Tophat read alignments in the EIF3CL gene for exons 9–13 before (first track) and after (second track) SEECER correction with human data. Colored dots highlight positions with deviations to the reference sequence in the gray read alignments. Bottom: Summary view of the whole region displaying the longest transfrag assembled. Oases assembled the transcript ENST00000380876 (EIF3CL) to 95% of its length with SEECER corrected data (red transfrag), whereas it was only assembled to 45% of its length when using the original data (blue transfrag).
Figure 4. De novo assembly of sea cucumber data. (A) A living sea cucumber Parastichopus parvimensis. (B) Distribution of BlastX matches of sea cucumber transfrags to known sea urchin peptides. The percentages represent the subset of sea urchin peptides that we have significantly matched to at least one transfrag in time point 1 and/or time point 2 and those that were not matched to any transfrag. (C) Ethidium bromide–stained image of PCR products amplified from sea cucumber cDNA. Primer pairs were designed against 14 assembled transfrags, seven of which matched to known peptides of RNAs (top row), and seven other that had no match in the database (bottom row). Standard ladders of 100-bp size are in the first and last lanes. Each lane is followed by the appropriate no template control to demonstrate that amplification was not due to non-specific contamination.
Bao,
SEED: efficient clustering of next-generation sequences.
2011, Pubmed
Bao,
SEED: efficient clustering of next-generation sequences.
2011,
Pubmed
Barrett,
NCBI GEO: archive for functional genomics data sets--10 years on.
2011,
Pubmed
Berriz,
Next generation software for functional trend analysis.
2009,
Pubmed
Bullard,
Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.
2010,
Pubmed
Cabili,
Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.
2011,
Pubmed
Camacho,
BLAST+: architecture and applications.
2009,
Pubmed
Davidson,
A genomic regulatory network for development.
2002,
Pubmed
,
Echinobase
Dohm,
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.
2008,
Pubmed
Du,
Transcriptome sequencing and characterization for the sea cucumber Apostichopus japonicus (Selenka, 1867).
2012,
Pubmed
,
Echinobase
Döring,
SeqAn an efficient, generic C++ library for sequence analysis.
2008,
Pubmed
Eddy,
Profile hidden Markov models.
1998,
Pubmed
Emde,
Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS.
2012,
Pubmed
Galperin,
The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection.
2012,
Pubmed
Grabherr,
Full-length transcriptome assembly from RNA-Seq data without a reference genome.
2011,
Pubmed
Hansen,
Biases in Illumina transcriptome sequencing caused by random hexamer priming.
2010,
Pubmed
Heap,
Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing.
2010,
Pubmed
Hinman,
Evolutionary plasticity of developmental gene regulatory network architecture.
2007,
Pubmed
,
Echinobase
Ilie,
HiTEC: accurate error correction in high-throughput sequencing data.
2011,
Pubmed
Kao,
ECHO: a reference-free short-read error correction algorithm.
2011,
Pubmed
Kelley,
Quake: quality-aware detection and correction of sequencing errors.
2010,
Pubmed
Kent,
BLAT--the BLAST-like alignment tool.
2002,
Pubmed
Langmead,
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
2009,
Pubmed
Li,
Modeling non-uniformity in short-read rates in RNA-Seq data.
2010,
Pubmed
Li,
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.
2011,
Pubmed
Marioni,
RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.
2008,
Pubmed
Marçais,
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.
2011,
Pubmed
Medvedev,
Error correction of high-throughput sequencing datasets with non-uniform coverage.
2011,
Pubmed
Mortazavi,
Mapping and quantifying mammalian transcriptomes by RNA-Seq.
2008,
Pubmed
Nookaew,
A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae.
2012,
Pubmed
Oshlack,
Transcript length bias in RNA-seq data confounds systems biology.
2009,
Pubmed
Peng,
Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome.
2012,
Pubmed
Qu,
Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.
2009,
Pubmed
Richard,
Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments.
2010,
Pubmed
Risso,
GC-content normalization for RNA-Seq data.
2011,
Pubmed
Roberts,
Streaming fragment assignment for real-time analysis of sequencing experiments.
2013,
Pubmed
Roberts,
Improving RNA-Seq expression estimates by correcting for fragment bias.
2011,
Pubmed
Robertson,
De novo assembly and analysis of RNA-seq data.
2010,
Pubmed
Saccone,
New tools and methods for direct programmatic access to the dbSNP relational database.
2011,
Pubmed
Salmela,
Correcting errors in short reads by multiple alignments.
2011,
Pubmed
Schröder,
SHREC: a short-read error correction method.
2009,
Pubmed
Schulz,
Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels.
2012,
Pubmed
Smeds,
ConDeTri--a content dependent read trimmer for Illumina data.
2011,
Pubmed
Sultan,
A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome.
2008,
Pubmed
Trapnell,
TopHat: discovering splice junctions with RNA-Seq.
2009,
Pubmed
UniProt Consortium,
Ongoing and future developments at the Universal Protein Resource.
2011,
Pubmed
Untergasser,
Primer3Plus, an enhanced web interface to Primer3.
2007,
Pubmed
Wada,
Phylogenetic relationships among extant classes of echinoderms, as inferred from sequences of 18S rDNA, coincide with relationships deduced from the fossil record.
1994,
Pubmed
,
Echinobase
Wang,
RNA-Seq: a revolutionary tool for transcriptomics.
2009,
Pubmed
Wijaya,
Recount: expectation maximization based error correction tool for next generation sequencing data.
2009,
Pubmed
Yang,
A survey of error-correction methods for next-generation sequencing.
2013,
Pubmed
Yang,
Reptile: representative tiling for short read error correction.
2010,
Pubmed