Publication:
MS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences

dc.bibliographiccitation.artnumber406
dc.bibliographiccitation.journalBMC Bioinformatics
dc.bibliographiccitation.volume11
dc.contributor.authorCorel, Eduardo
dc.contributor.authorPitschi, Florian
dc.contributor.authorLaprevotte, Ivan
dc.contributor.authorGrasseau, Gilles
dc.contributor.authorDidier, Gilles
dc.contributor.authorDevauchelle, Claudine
dc.date.accessioned2018-11-07T08:41:15Z
dc.date.available2018-11-07T08:41:15Z
dc.date.issued2010
dc.description.abstractBackground: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. Results: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). Conclusions: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.
dc.description.sponsorshipGenopole; Deutsche Forschungsgemeinschaft [MO 1048/6-1]
dc.identifier.doi10.1186/1471-2105-11-406
dc.identifier.isi000281442500003
dc.identifier.pmid20673356
dc.identifier.purlhttps://resolver.sub.uni-goettingen.de/purl?gs-1/5666
dc.identifier.urihttps://resolver.sub.uni-goettingen.de/purl?gro-2/19426
dc.item.fulltextWith Fulltext
dc.notes.internMerged from goescholar
dc.notes.statuszu prüfen
dc.notes.submitterNajko
dc.publisherBiomed Central Ltd
dc.relation.issn1471-2105
dc.rightsCC BY 2.0
dc.rights.urihttps://creativecommons.org/licenses/by/2.0
dc.titleMS4-Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
dc.typejournal_article
dc.type.internalPublicationyes
dc.type.peerReviewedyes
dc.type.statuspublished
dc.type.versionpublished_version
dspace.entity.typePublication

Files

Original bundle

Now showing 1 - 5 of 12
Loading...
Thumbnail Image
Name:
1471-2105-11-406-S12.GZ
Size:
74.5 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
1471-2105-11-406-S9.PDF
Size:
31.9 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
1471-2105-11-406-S2.PNG
Size:
56.4 KB
Format:
Portable Network Graphics
Loading...
Thumbnail Image
Name:
1471-2105-11-406.pdf
Size:
2.13 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
1471-2105-11-406-S4.PNG
Size:
72.37 KB
Format:
Portable Network Graphics

Collections