Novel massively parallel sequencing technologies provide a highly detailed structure of transcriptome and genome by yielding deep coverage of short reads, though their utility is interfered due to a considerable sequencing quality problem and short length of reads. Sequencing-error trimming in short reads is therefore a vital process which could improve the successful rate of reference mapping as well as polymorphorism detection. Our algorithm organizes erroneous short sequences originating in a single abundant sequence into a tree structure such that each child sequence is considered to be derived stochastically from its more abundant parent sequence because of sequencing errors.
Source code
Publication
Efficient frequency-based de novo short read clustering for error trimming in next-generation sequencing.
Genome Res. 2009. 19:1309-1315
Data used in the publicated paper : SRA003629